* [PATCH 0/4] fix depvpts in user namespaces @ 2013-03-15 9:13 Glauber Costa [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 9:13 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman Hi, devpts mounts in user namespaces is queued for 3.9. However, while playing with it I found it to be less than ideal. Although it could possibly work with custom software that can be made to point to /dev/pts/ptmx, a few things prevent it from working correctly for people that, like us, are booting full distributions. In those scenarios, things like udev will kick in, maybe remount /dev undoing any setup we might have done, and then software like sshd or anything else calling openpty will search for /dev/ptmx, not /dev/pts/ptmx. One of the problems that I am addressing in here is that we are disallowing mknod in usernamespaces. Although I understand the motivation for that, I believe that to be too restrictive, specially because we already control access to the files separately. There should be no harm in mknod'ing something per se, if manipulating it is forbidden. That too, however, is too restrictive. Following the precedence that we set by letting memcg manage the memory for tmpfs mounts, I am doing the same here with the device cgroup. With the exception that instead of suggesting, here we have a way to actually enforce it. Unless the mount was specifically marked as nodev, reads and writes will be allowed to proceed if a device cgroup is containing the process. The device cgroup will then be the one responsible for setting fine grained access about which devices can and cannot be manipulated. Last, /dev/ptmx will still always be the global ptmx device. We need to somehow link it to our namespaces'. My proposal is to multiplex it and return the correct "root ptmx" depending on which userns is reading that device. Glauber Costa (4): dev_cgroup: keep track of which cgroup is the root cgroup fs: allow dev accesses in userns in controlled situations fs: allow mknod in user namespaces devpts: fix usage in user namespaces fs/devpts/inode.c | 157 +++++++++++++++++++++++++++++++++++++++++++++-- fs/namei.c | 6 +- fs/namespace.c | 2 +- include/linux/mount.h | 2 + include/linux/security.h | 1 + security/device_cgroup.c | 15 ++++- 6 files changed, 173 insertions(+), 10 deletions(-) -- 1.8.1.2 ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 9:13 ` Glauber Costa [not found] ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 9:13 ` [PATCH 2/4] fs: allow dev accesses in userns in controlled situations Glauber Costa ` (3 subsequent siblings) 4 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 9:13 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman Most of the other subsystems already keep track of that in some way. We will do that internally and provide a test to determine whether or not our task is in a device cgroup that is not the root one. We can relax some of our checks in that case, trusting that whoever set device cgroup rules will be responsible to control access to their devices. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> --- include/linux/security.h | 1 + security/device_cgroup.c | 15 +++++++++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/include/linux/security.h b/include/linux/security.h index eee7478..fe58f71 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p); extern int cap_task_setioprio(struct task_struct *p, int ioprio); extern int cap_task_setnice(struct task_struct *p, int nice); extern int cap_vm_enough_memory(struct mm_struct *mm, long pages); +bool *task_in_child_devcgroup(struct task_struct *task); struct msghdr; struct sk_buff; diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 1c69e38..03df5b2 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) return css_to_devcgroup(task_subsys_state(task, devices_subsys_id)); } +static struct dev_cgroup *root_devcgroup; +bool task_in_child_devcgroup(struct task_struct *task) +{ + bool ret; + rcu_read_lock(); + ret = task_devcgroup(task) != root_devcgroup; + rcu_read_unlock(); + return ret; +} + struct cgroup_subsys devices_subsys; static int devcgroup_can_attach(struct cgroup *new_cgrp, @@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup) INIT_LIST_HEAD(&dev_cgroup->exceptions); parent_cgroup = cgroup->parent; - if (parent_cgroup == NULL) + if (parent_cgroup == NULL) { dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW; - else { + root_devcgroup = dev_cgroup; + } else { parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup); mutex_lock(&devcgroup_mutex); ret = dev_exceptions_copy(&dev_cgroup->exceptions, -- 1.8.1.2 ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup [not found] ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 14:07 ` Serge Hallyn 2013-03-15 14:43 ` Glauber Costa 2013-03-15 19:27 ` Aristeu Rozanski 1 sibling, 1 reply; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 14:07 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, Li Zefan Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > Most of the other subsystems already keep track of that in some way. We > will do that internally and provide a test to determine whether or not > our task is in a device cgroup that is not the root one. We can relax > some of our checks in that case, trusting that whoever set device cgroup > rules will be responsible to control access to their devices. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Patch looks fine. AFAIK we're still waiting on Aristeu's patchset to hit upstream. As your patches are simpler I'd prefer, if there is churn, for yours to be refactored than his. Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> > Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> > --- > include/linux/security.h | 1 + > security/device_cgroup.c | 15 +++++++++++++-- > 2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/include/linux/security.h b/include/linux/security.h > index eee7478..fe58f71 100644 > --- a/include/linux/security.h > +++ b/include/linux/security.h > @@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p); > extern int cap_task_setioprio(struct task_struct *p, int ioprio); > extern int cap_task_setnice(struct task_struct *p, int nice); > extern int cap_vm_enough_memory(struct mm_struct *mm, long pages); > +bool *task_in_child_devcgroup(struct task_struct *task); > > struct msghdr; > struct sk_buff; > diff --git a/security/device_cgroup.c b/security/device_cgroup.c > index 1c69e38..03df5b2 100644 > --- a/security/device_cgroup.c > +++ b/security/device_cgroup.c > @@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) > return css_to_devcgroup(task_subsys_state(task, devices_subsys_id)); > } > > +static struct dev_cgroup *root_devcgroup; > +bool task_in_child_devcgroup(struct task_struct *task) > +{ > + bool ret; > + rcu_read_lock(); > + ret = task_devcgroup(task) != root_devcgroup; > + rcu_read_unlock(); > + return ret; > +} > + > struct cgroup_subsys devices_subsys; > > static int devcgroup_can_attach(struct cgroup *new_cgrp, > @@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup) > INIT_LIST_HEAD(&dev_cgroup->exceptions); > parent_cgroup = cgroup->parent; > > - if (parent_cgroup == NULL) > + if (parent_cgroup == NULL) { > dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW; > - else { > + root_devcgroup = dev_cgroup; > + } else { > parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup); > mutex_lock(&devcgroup_mutex); > ret = dev_exceptions_copy(&dev_cgroup->exceptions, > -- > 1.8.1.2 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup 2013-03-15 14:07 ` Serge Hallyn @ 2013-03-15 14:43 ` Glauber Costa 2013-03-15 14:55 ` Serge Hallyn 0 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 14:43 UTC (permalink / raw) To: Serge Hallyn Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, Li Zefan On 03/15/2013 06:07 PM, Serge Hallyn wrote: > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): >> Most of the other subsystems already keep track of that in some way. We >> will do that internally and provide a test to determine whether or not >> our task is in a device cgroup that is not the root one. We can relax >> some of our checks in that case, trusting that whoever set device cgroup >> rules will be responsible to control access to their devices. >> >> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> >> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> >> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > > Patch looks fine. AFAIK we're still waiting on Aristeu's patchset to > hit upstream. As your patches are simpler I'd prefer, if there is > churn, for yours to be refactored than his. > I have no problem with that. There is also a small build issue here that needs to be fixed. If you allow me, with my guarantees that the patch will be preserved in spirit I will keep your ack after any refactoring =) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup 2013-03-15 14:43 ` Glauber Costa @ 2013-03-15 14:55 ` Serge Hallyn 0 siblings, 0 replies; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 14:55 UTC (permalink / raw) To: Glauber Costa Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman, Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski, Li Zefan Quoting Glauber Costa (glommer@parallels.com): > On 03/15/2013 06:07 PM, Serge Hallyn wrote: > > Quoting Glauber Costa (glommer@parallels.com): > >> Most of the other subsystems already keep track of that in some way. We > >> will do that internally and provide a test to determine whether or not > >> our task is in a device cgroup that is not the root one. We can relax > >> some of our checks in that case, trusting that whoever set device cgroup > >> rules will be responsible to control access to their devices. > >> > >> Signed-off-by: Glauber Costa <glommer@parallels.com> > >> Cc: Aristeu Rozanski <aris@redhat.com> > >> Cc: Eric Biederman <ebiederm@xmission.com> > >> Cc: Serge Hallyn <serge.hallyn@canonical.com> > > > > Patch looks fine. AFAIK we're still waiting on Aristeu's patchset to > > hit upstream. As your patches are simpler I'd prefer, if there is > > churn, for yours to be refactored than his. > > > I have no problem with that. There is also a small build issue here that > needs to be fixed. If you allow me, with my guarantees that the patch > will be preserved in spirit I will keep your ack after any refactoring =) > Of course :) Please keep my ack on it so long as you don't feel any substantive changes were needed :) -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup [not found] ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 14:07 ` Serge Hallyn @ 2013-03-15 19:27 ` Aristeu Rozanski 1 sibling, 0 replies; 28+ messages in thread From: Aristeu Rozanski @ 2013-03-15 19:27 UTC (permalink / raw) To: Glauber Costa Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman On Fri, Mar 15, 2013 at 01:13:40PM +0400, Glauber Costa wrote: > Most of the other subsystems already keep track of that in some way. We > will do that internally and provide a test to determine whether or not > our task is in a device cgroup that is not the root one. We can relax > some of our checks in that case, trusting that whoever set device cgroup > rules will be responsible to control access to their devices. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> > --- > include/linux/security.h | 1 + > security/device_cgroup.c | 15 +++++++++++++-- > 2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/include/linux/security.h b/include/linux/security.h > index eee7478..fe58f71 100644 > --- a/include/linux/security.h > +++ b/include/linux/security.h > @@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p); > extern int cap_task_setioprio(struct task_struct *p, int ioprio); > extern int cap_task_setnice(struct task_struct *p, int nice); > extern int cap_vm_enough_memory(struct mm_struct *mm, long pages); > +bool *task_in_child_devcgroup(struct task_struct *task); > > struct msghdr; > struct sk_buff; > diff --git a/security/device_cgroup.c b/security/device_cgroup.c > index 1c69e38..03df5b2 100644 > --- a/security/device_cgroup.c > +++ b/security/device_cgroup.c > @@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) > return css_to_devcgroup(task_subsys_state(task, devices_subsys_id)); > } > > +static struct dev_cgroup *root_devcgroup; > +bool task_in_child_devcgroup(struct task_struct *task) > +{ > + bool ret; > + rcu_read_lock(); > + ret = task_devcgroup(task) != root_devcgroup; > + rcu_read_unlock(); > + return ret; > +} > + > struct cgroup_subsys devices_subsys; > > static int devcgroup_can_attach(struct cgroup *new_cgrp, > @@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup) > INIT_LIST_HEAD(&dev_cgroup->exceptions); > parent_cgroup = cgroup->parent; > > - if (parent_cgroup == NULL) > + if (parent_cgroup == NULL) { > dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW; > - else { > + root_devcgroup = dev_cgroup; > + } else { > parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup); > mutex_lock(&devcgroup_mutex); > ret = dev_exceptions_copy(&dev_cgroup->exceptions, patch looks good Acked-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> -- Aristeu ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/4] fs: allow dev accesses in userns in controlled situations [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 9:13 ` [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup Glauber Costa @ 2013-03-15 9:13 ` Glauber Costa 2013-03-15 14:20 ` Serge Hallyn 2013-03-15 9:13 ` [PATCH 3/4] fs: allow mknod in user namespaces Glauber Costa ` (2 subsequent siblings) 4 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 9:13 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman So far, unless the filesystem explicitly marks it (and most don't), processes running in user namespaces won't be allowed to access any devices. Although this makes sense, this is a quite restrictive rule, since a lot of those accesses would be perfectly safe: aside from the simple char devices in /dev/ like null, zero, etc, it is perfectly possible to assign a device for usage inside a namespace if we can establish trust in that operation. We will do that by marking the mount as MNT_NODEV_NS instead of MNT_NODEV. This is because if the mount operation explicitly asked for nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the task is not on a device cgroup. If it is, we will rely on the control rules in devcg to intermediate the access an tell us what those tasks can or cannot do. There is precedence for that with memcg: although we don't explicitly test it like I am doing it here, we are allowing tmpfs mounts to happen in user namespaces because memcg will contain them. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> --- fs/namei.c | 4 ++++ fs/namespace.c | 2 +- include/linux/mount.h | 2 ++ 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/namei.c b/fs/namei.c index 57ae9c8..8a34d79 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2356,6 +2356,10 @@ static int may_open(struct path *path, int acc_mode, int flag) case S_IFCHR: if (path->mnt->mnt_flags & MNT_NODEV) return -EACCES; + + if ((path->mnt->mnt_flags & MNT_NODEV_NS) && + !task_in_child_devcgroup(current)) + return -EACCES; /*FALLTHRU*/ case S_IFIFO: case S_IFSOCK: diff --git a/fs/namespace.c b/fs/namespace.c index 50ca17d..fe8127e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1935,7 +1935,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, */ if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) { flags |= MS_NODEV; - mnt_flags |= MNT_NODEV; + mnt_flags |= MNT_NODEV_NS; } } diff --git a/include/linux/mount.h b/include/linux/mount.h index d7029f4..8d190e4 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -32,6 +32,8 @@ struct mnt_namespace; #define MNT_SHRINKABLE 0x100 #define MNT_WRITE_HOLD 0x200 +#define MNT_NODEV_NS 0x400 /* userns mount, and nodev not explicit */ + #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ /* -- 1.8.1.2 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations 2013-03-15 9:13 ` [PATCH 2/4] fs: allow dev accesses in userns in controlled situations Glauber Costa @ 2013-03-15 14:20 ` Serge Hallyn 0 siblings, 0 replies; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 14:20 UTC (permalink / raw) To: Glauber Costa Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman, Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski Quoting Glauber Costa (glommer@parallels.com): > So far, unless the filesystem explicitly marks it (and most don't), > processes running in user namespaces won't be allowed to access any > devices. Although this makes sense, this is a quite restrictive rule, > since a lot of those accesses would be perfectly safe: aside from the > simple char devices in /dev/ like null, zero, etc, it is perfectly > possible to assign a device for usage inside a namespace if we can > establish trust in that operation. Yeah we've talked about that too - as Eric pointed out it's not strictly necessary as we can set that up with bind mounts from the host's /dev. So if he wants to nack - especially temporarily while other things fall into place - I won't argue, but I'm happy with this. > We will do that by marking the mount as MNT_NODEV_NS instead of > MNT_NODEV. This is because if the mount operation explicitly asked for > nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the > task is not on a device cgroup. If it is, we will rely on the control > rules in devcg to intermediate the access an tell us what those tasks > can or cannot do. Well the devcg was meant to be a temporary stopgap solution until we have device namespaces, and this seems to entrench them further, but it does make sense. > There is precedence for that with memcg: although we don't explicitly > test it like I am doing it here, we are allowing tmpfs mounts to happen > in user namespaces because memcg will contain them. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > Cc: Aristeu Rozanski <aris@redhat.com> > Cc: Eric Biederman <ebiederm@xmission.com> > Cc: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> > --- > fs/namei.c | 4 ++++ > fs/namespace.c | 2 +- > include/linux/mount.h | 2 ++ > 3 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 57ae9c8..8a34d79 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2356,6 +2356,10 @@ static int may_open(struct path *path, int acc_mode, int flag) > case S_IFCHR: > if (path->mnt->mnt_flags & MNT_NODEV) > return -EACCES; > + > + if ((path->mnt->mnt_flags & MNT_NODEV_NS) && > + !task_in_child_devcgroup(current)) > + return -EACCES; > /*FALLTHRU*/ > case S_IFIFO: > case S_IFSOCK: > diff --git a/fs/namespace.c b/fs/namespace.c > index 50ca17d..fe8127e 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1935,7 +1935,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, > */ > if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) { > flags |= MS_NODEV; > - mnt_flags |= MNT_NODEV; > + mnt_flags |= MNT_NODEV_NS; > } > } > > diff --git a/include/linux/mount.h b/include/linux/mount.h > index d7029f4..8d190e4 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -32,6 +32,8 @@ struct mnt_namespace; > #define MNT_SHRINKABLE 0x100 > #define MNT_WRITE_HOLD 0x200 > > +#define MNT_NODEV_NS 0x400 /* userns mount, and nodev not explicit */ > + > #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ > #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ > /* > -- > 1.8.1.2 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/4] fs: allow mknod in user namespaces [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 9:13 ` [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup Glauber Costa 2013-03-15 9:13 ` [PATCH 2/4] fs: allow dev accesses in userns in controlled situations Glauber Costa @ 2013-03-15 9:13 ` Glauber Costa [not found] ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (2 more replies) 2013-03-15 9:13 ` [PATCH 4/4] devpts: fix usage " Glauber Costa 2013-03-15 10:26 ` [PATCH 0/4] fix depvpts " Eric W. Biederman 4 siblings, 3 replies; 28+ messages in thread From: Glauber Costa @ 2013-03-15 9:13 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Glauber Costa, Aristeu Rozanski Since we have strict control on who access the devices, it should be no problem to allow the device to appear. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> --- fs/namei.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namei.c b/fs/namei.c index 8a34d79..d0b4549 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD)) return -EPERM; if (!dir->i_op->mknod) -- 1.8.1.2 ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH 3/4] fs: allow mknod in user namespaces [not found] ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 14:37 ` Serge Hallyn 2013-03-15 14:49 ` Glauber Costa 0 siblings, 1 reply; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 14:37 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > Since we have strict control on who access the devices, it should be > no problem to allow the device to appear. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > --- > fs/namei.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 8a34d79..d0b4549 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) > if (error) > return error; > > - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) > + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD)) I realize you're arguing that devicens is enough, but how about doing inode_capable(dir, CAP_MKNOD) instead? > return -EPERM; > > if (!dir->i_op->mknod) > -- > 1.8.1.2 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/4] fs: allow mknod in user namespaces 2013-03-15 14:37 ` Serge Hallyn @ 2013-03-15 14:49 ` Glauber Costa [not found] ` <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 14:49 UTC (permalink / raw) To: Serge Hallyn Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman, Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski On 03/15/2013 06:37 PM, Serge Hallyn wrote: > Quoting Glauber Costa (glommer@parallels.com): >> Since we have strict control on who access the devices, it should be >> no problem to allow the device to appear. >> >> Signed-off-by: Glauber Costa <glommer@parallels.com> >> Cc: Aristeu Rozanski <aris@redhat.com> >> Cc: Eric Biederman <ebiederm@xmission.com> >> Cc: Serge Hallyn <serge.hallyn@canonical.com> >> --- >> fs/namei.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/namei.c b/fs/namei.c >> index 8a34d79..d0b4549 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) >> if (error) >> return error; >> >> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) >> + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD)) > > I realize you're arguing that devicens is enough, but how about > doing inode_capable(dir, CAP_MKNOD) instead? > I see no reason not to do it. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH 3/4] fs: allow mknod in user namespaces [not found] ` <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 15:14 ` Serge Hallyn 0 siblings, 0 replies; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 15:14 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > On 03/15/2013 06:37 PM, Serge Hallyn wrote: > > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > >> Since we have strict control on who access the devices, it should be > >> no problem to allow the device to appear. > >> > >> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > >> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > >> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > >> --- > >> fs/namei.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/fs/namei.c b/fs/namei.c > >> index 8a34d79..d0b4549 100644 > >> --- a/fs/namei.c > >> +++ b/fs/namei.c > >> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) > >> if (error) > >> return error; > >> > >> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) > >> + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD)) > > > > I realize you're arguing that devicens is enough, but how about > > doing inode_capable(dir, CAP_MKNOD) instead? > > > I see no reason not to do it. Cool, with that Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> thanks. -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/4] fs: allow mknod in user namespaces 2013-03-15 9:13 ` [PATCH 3/4] fs: allow mknod in user namespaces Glauber Costa [not found] ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 18:03 ` Vasily Kulikov 2013-03-15 20:43 ` Eric W. Biederman 2 siblings, 0 replies; 28+ messages in thread From: Vasily Kulikov @ 2013-03-15 18:03 UTC (permalink / raw) To: Glauber Costa Cc: cgroups, containers, Aristeu Rozanski, mtk.manpages, linux-fsdevel, Andrew Morton, Eric W. Biederman On Fri, Mar 15, 2013 at 13:13 +0400, Glauber Costa wrote: > Since we have strict control on who access the devices, it should be > no problem to allow the device to appear. ... > - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) > + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD)) As now we have several mechanisms for dev nodes usage in containers (cgroup, per-fs flags, CAP_MKNOD), probably it's better to document all this stuff in a single document? Enumerate all possible limits of device files creation and usage, and describe unobvious ways to abuse the API to escape from a container (allow loopback device? allow ext4? just guessing, didn't investigate myself). It should document several safe patterns for common cases and beware of known-to-be-vulnerable ones. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/4] fs: allow mknod in user namespaces 2013-03-15 9:13 ` [PATCH 3/4] fs: allow mknod in user namespaces Glauber Costa [not found] ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 18:03 ` Vasily Kulikov @ 2013-03-15 20:43 ` Eric W. Biederman 2013-03-16 0:23 ` Serge Hallyn 2 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2013-03-15 20:43 UTC (permalink / raw) To: Glauber Costa Cc: cgroups, Andrew Morton, mtk.manpages, Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski Glauber Costa <glommer@parallels.com> writes: > Since we have strict control on who access the devices, it should be > no problem to allow the device to appear. Having cgroups or user namespaces grant privileges makes me uneasy. With these patches it looks like I can do something evil like. 1. Create a devcgroup. 2. Put a process in it. 3. Create a usernamespace. 4. Run a container in that user namespace. 5. As an unprivileged user in that user namespace create another user namespace. 6. Call mknod and have it succeed. Or in short I don't think this handles nested user namespaces at all. With or without Serge's suggested change. At a practical level now is not the right time to be granting more permissions to user namespaces. Lately too many silly bugs have been found in what is already there. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/4] fs: allow mknod in user namespaces 2013-03-15 20:43 ` Eric W. Biederman @ 2013-03-16 0:23 ` Serge Hallyn 0 siblings, 0 replies; 28+ messages in thread From: Serge Hallyn @ 2013-03-16 0:23 UTC (permalink / raw) To: Eric W. Biederman Cc: Glauber Costa, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski Quoting Eric W. Biederman (ebiederm@xmission.com): > Glauber Costa <glommer@parallels.com> writes: > > > Since we have strict control on who access the devices, it should be > > no problem to allow the device to appear. > > Having cgroups or user namespaces grant privileges makes me uneasy. > > With these patches it looks like I can do something evil like. > > 1. Create a devcgroup. > 2. Put a process in it. > 3. Create a usernamespace. > 4. Run a container in that user namespace. > 5. As an unprivileged user in that user namespace create another user namespace. > 6. Call mknod and have it succeed. not if the devcgroup forbids it. > Or in short I don't think this handles nested user namespaces at all. > With or without Serge's suggested change. Yeah my change doesn't help, other than to stop the unpriv user from creating the device in an fs he doesn't own... > At a practical level now is not the right time to be granting more > permissions to user namespaces. Lately too many silly bugs have been > found in what is already there. I agree. I realize this doesn't help the centos old-udev situation, but otherwise bind mounting device files works fine, so I agree we should wait. Sorry. -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/4] devpts: fix usage in user namespaces [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (2 preceding siblings ...) 2013-03-15 9:13 ` [PATCH 3/4] fs: allow mknod in user namespaces Glauber Costa @ 2013-03-15 9:13 ` Glauber Costa [not found] ` <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 10:26 ` [PATCH 0/4] fix depvpts " Eric W. Biederman 4 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 9:13 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Glauber Costa The current usage of devpts in user namespaces is quite problematic for a couple of reasons: * It requires the newinstance option, which is a sensible thing. However, if we are running full distributions inside containers or any other form of unmodified apps, that option wouldn't be passed for what they believe to be the initial mount. We could of course do the mount in the container initialization itself, but many distributions will somehow remount /dev in temporary storage for udev, and then mount devpts ontop of that. That gets rid of our setup, and overcoming that would require a complex synchronization between the control tool and the distribution booting. A much better behavior is to imply that option for the first mount in the user namespace. * Again, unless we are talking about a custom application, standard procedures like ssh, openpty, and friends, will try to open /dev/ptmx instead of /dev/pts/ptmx. We can link that in userspace but that is quite cumbersome, for the same reasons listed above. A better behavior is to respect newinstance mounts regardless of whether they come from, but keep track of who is the first mount inside the usernamespace. That becomes the "root" of that namespace, and inodes without ties to a specific superblock will point to that instead of devpts_mount. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> --- fs/devpts/inode.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 151 insertions(+), 6 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 073d30b..58fbff4 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -123,12 +123,129 @@ static const match_table_t tokens = { {Opt_err, NULL} }; +struct userns_ptmx { + struct list_head list; + struct user_namespace *owner; + struct super_block *sb; +}; + struct pts_fs_info { struct ida allocated_ptys; struct pts_mount_opts mount_opts; struct dentry *ptmx_dentry; +#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS) + struct userns_ptmx *ns_ptmx; +#endif }; +#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS) +/* + * When using user namespaces, we will require devpts to be created as a new + * instance and thus will span a new superblock. However, applications running + * inside a container may still try to open /dev/ptmx, instead of + * /dev/pts/ptmx. That will be the rule for all non custom applications. When + * that happen, they will end up opening the root namespace's /dev/ptmx. That + * will in turn create a new pts in the main /dev, outside the reach of our + * user namespace. + * + * The rule to deal with it is very simple: The first /dev/pts mount inside the + * namespace will become the owner of that namespace. Anyone reading /dev/ptmx + * will be redirected to /dev/pts/ptmx of that mount. Subsequent mounts will + * have to create a new instance to have a separated ptmx, just the way it it + * for non user namespace. + */ +static LIST_HEAD(userns_ptmx_list); +static DEFINE_MUTEX(userns_ptmx_mutex); +struct pts_fs_info; + +/* + * Having ns_ptmx stored somewhere in the superblock (and fsi is the obvious + * choice) is para-mount (pun intended) destroying it while umounting. We + * need to signal that there is something to destroy. + */ +static void define_ns_root_ptmx(struct super_block *sb, + struct userns_ptmx *ns_ptmx) +{ + struct pts_fs_info *fsi = sb->s_fs_info; + if (!ns_ptmx) + return; + + fsi->ns_ptmx = ns_ptmx; + ns_ptmx->sb = sb; +} + +static void remove_ns_root_ptmx(struct pts_fs_info *fsi) +{ + if (!fsi->ns_ptmx) + return; + + if (WARN_ON(fsi->ns_ptmx->owner == &init_user_ns)) + return; + + mutex_lock(&userns_ptmx_mutex); + list_del(&fsi->ns_ptmx->list); + mutex_unlock(&userns_ptmx_mutex); + kfree(fsi->ns_ptmx); +} + +static struct super_block *pts_sb_of_namespace(struct user_namespace *userns) +{ + struct userns_ptmx *ns_ptmx = NULL; + struct super_block *sb = NULL; + + mutex_lock(&userns_ptmx_mutex); + list_for_each_entry(ns_ptmx, &userns_ptmx_list, list) { + if (ns_ptmx->owner != userns) + continue; + sb = ns_ptmx->sb; + goto out; + } +out: + mutex_unlock(&userns_ptmx_mutex); + return sb; +} + +static struct userns_ptmx *pts_new_namespace_root(struct user_namespace *userns) +{ + struct userns_ptmx *ns_ptmx; + if (userns == &init_user_ns) + return NULL; + + ns_ptmx = kzalloc(sizeof(*ns_ptmx), GFP_KERNEL); + if (!ns_ptmx) + return ns_ptmx; + + ns_ptmx->owner = userns; + mutex_lock(&userns_ptmx_mutex); + list_add(&ns_ptmx->list, &userns_ptmx_list); + mutex_unlock(&userns_ptmx_mutex); + + return ns_ptmx; +} +#else +static inline void define_ns_root_ptmx(struct super_block *sb, + struct userns_ptmx *ns_ptmx) +{ +} + +static inline struct super_block * +pts_sb_of_namespace(struct user_namespace *userns) +{ + return NULL; +} + +static inline struct userns_ptmx * +pts_new_namespace_root(struct user_namespace *userns) +{ + return NULL; +} + +static inline void remove_ns_root_ptmx(struct pts_fs_info *fsi) +{ +} +#endif + + static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb) { return sb->s_fs_info; @@ -139,6 +256,13 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode) #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC) return inode->i_sb; + else if (current_user_ns() != &init_user_ns) + /* + * If there is valid devpts superblock associated with the + * inode, then we respect it no matter what. Otherwise, which + * superblock we return depends on which namespace we are at. + */ + return pts_sb_of_namespace(current_user_ns()); #endif return devpts_mnt->mnt_sb; } @@ -442,16 +566,31 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, int error; struct pts_mount_opts opts; struct super_block *s; + struct user_namespace *userns = current_user_ns(); + struct userns_ptmx *ns_ptmx = NULL; error = parse_mount_options(data, PARSE_MOUNT, &opts); if (error) return ERR_PTR(error); - /* Require newinstance for all user namespace mounts to ensure - * the mount options are not changed. + /* + * Require newinstance for all user namespace mounts to ensure + * the mount options are not changed. If this is the first mount + * of the namespace, that option is implied. */ - if ((current_user_ns() != &init_user_ns) && !opts.newinstance) - return ERR_PTR(-EINVAL); + if ((userns != &init_user_ns) && !opts.newinstance) { + /* + * And this is how we know we're talking about the first + * mount. If we don't have an assigned ptmx of userns, + * then we have to create it. + */ + if (pts_sb_of_namespace(userns)) + return ERR_PTR(-EINVAL); + ns_ptmx = pts_new_namespace_root(userns); + if (!ns_ptmx) + return ERR_PTR(-ENOMEM); + opts.newinstance = 1; + } if (opts.newinstance) s = sget(fs_type, NULL, set_anon_super, flags, NULL); @@ -459,8 +598,10 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, s = sget(fs_type, compare_init_pts_sb, set_anon_super, flags, NULL); - if (IS_ERR(s)) - return ERR_CAST(s); + if (IS_ERR(s)) { + error = PTR_ERR(s); + goto out_no_s; + } if (!s->s_root) { error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0); @@ -470,6 +611,7 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, } memcpy(&(DEVPTS_SB(s))->mount_opts, &opts, sizeof(opts)); + define_ns_root_ptmx(s, ns_ptmx); error = mknod_ptmx(s); if (error) @@ -479,6 +621,8 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, out_undo_sget: deactivate_locked_super(s); +out_no_s: + kfree(ns_ptmx); return ERR_PTR(error); } @@ -498,6 +642,7 @@ static void devpts_kill_sb(struct super_block *sb) { struct pts_fs_info *fsi = DEVPTS_SB(sb); + remove_ns_root_ptmx(fsi); kfree(fsi); kill_litter_super(sb); } -- 1.8.1.2 ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH 4/4] devpts: fix usage in user namespaces [not found] ` <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 14:45 ` Serge Hallyn 0 siblings, 0 replies; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 14:45 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): Interesting approach, neat. But I'll let you and Eric fight it out :) thanks, -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (3 preceding siblings ...) 2013-03-15 9:13 ` [PATCH 4/4] devpts: fix usage " Glauber Costa @ 2013-03-15 10:26 ` Eric W. Biederman [not found] ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-03-15 14:00 ` Serge Hallyn 4 siblings, 2 replies; 28+ messages in thread From: Eric W. Biederman @ 2013-03-15 10:26 UTC (permalink / raw) To: Glauber Costa Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > Hi, > > devpts mounts in user namespaces is queued for 3.9. However, while playing > with it I found it to be less than ideal. Although it could possibly work > with custom software that can be made to point to /dev/pts/ptmx, a few things > prevent it from working correctly for people that, like us, are booting full > distributions. Full distributions that have not been modified to be minimally container aware. > In those scenarios, things like udev will kick in, maybe remount /dev undoing > any setup we might have done, and then software like sshd or anything else > calling openpty will search for /dev/ptmx, not /dev/pts/ptmx. I believe udev stopped running in containers a year or so ago. > One of the problems that I am addressing in here is that we are disallowing > mknod in usernamespaces. Although I understand the motivation for that, I > believe that to be too restrictive, specially because we already control access > to the files separately. There should be no harm in mknod'ing something per se, > if manipulating it is forbidden. mknod in userspace needs to be a separate patchset. There is no need to solve mknod in userspace to solve devpts. > Last, /dev/ptmx will still always be the global ptmx device. We need to somehow > link it to our namespaces'. My proposal is to multiplex it and return the > correct "root ptmx" depending on which userns is reading that device. Doable. I still strongly prefer my version of having /dev/ptmx act like a link to /dev/pts/ptmx. Letting the mount namespace control it. In testing that works, and it allows a lot of devpts complexity to just go away. For older versions of udev you can even configure them with a rule to make /dev/ptmx a symlink to /dev/pts/ptmx. Newer versions of udev completely gave up on creating devices and can longer be configured to do anything useful in this regard. So we might even be able to just get away with a bit of udev and devtmpfs configuration. And treat devpts as if newinstance is always specified. Certainly that has worked in my testing so far. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 0/4] fix depvpts in user namespaces [not found] ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-03-15 12:01 ` Glauber Costa 0 siblings, 0 replies; 28+ messages in thread From: Glauber Costa @ 2013-03-15 12:01 UTC (permalink / raw) To: Eric W. Biederman Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton On 03/15/2013 02:26 PM, Eric W. Biederman wrote: > Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > >> Hi, >> >> devpts mounts in user namespaces is queued for 3.9. However, while playing >> with it I found it to be less than ideal. Although it could possibly work >> with custom software that can be made to point to /dev/pts/ptmx, a few things >> prevent it from working correctly for people that, like us, are booting full >> distributions. > > Full distributions that have not been modified to be minimally container > aware. > Yes, which is true for every single distribution that was released before containers became so pervasive. I believe we should be able to make *better* decisions when we know we are in a container, but that should still work. >> In those scenarios, things like udev will kick in, maybe remount /dev undoing >> any setup we might have done, and then software like sshd or anything else >> calling openpty will search for /dev/ptmx, not /dev/pts/ptmx. > > I believe udev stopped running in containers a year or so ago. A year is not that big of a timeframe. I am running centos6 for instance, and it runs udev. That is not even that ancient for enterprise standards. > >> One of the problems that I am addressing in here is that we are disallowing >> mknod in usernamespaces. Although I understand the motivation for that, I >> believe that to be too restrictive, specially because we already control access >> to the files separately. There should be no harm in mknod'ing something per se, >> if manipulating it is forbidden. > > mknod in userspace needs to be a separate patchset. There is no need to > solve mknod in userspace to solve devpts. > Well, yes. Patches 1 - 3 are technically independent of patch 4. If you would review them and let me know what you think I would be much appreciated. Reiterating, the proposal is akin to memcg+tmpfs, but I am relaying control of devices to device cgroups, + requiring them to be present. > >> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow >> link it to our namespaces'. My proposal is to multiplex it and return the >> correct "root ptmx" depending on which userns is reading that device. > > Doable. I still strongly prefer my version of having /dev/ptmx act like > a link to /dev/pts/ptmx. Letting the mount namespace control it. > You mean an explicit link, or something else? > In testing that works, and it allows a lot of devpts complexity to just > go away. For older versions of udev you can even configure them with a > rule to make /dev/ptmx a symlink to /dev/pts/ptmx. At this point you are not getting rid of complexity, you are just moving it to a different location. Instead of handling it in the kernel, we know need to go and provide fixed configuration files for every single distribution one may want to run in a container. > > So we might even be able to just get away with a bit of udev and > devtmpfs configuration. And treat devpts as if newinstance is always > specified. Certainly that has worked in my testing so far. > I can confirm that linking /dev/pts/ptmx to /dev/ptmx works. And also that it needs configuration, and that this configuration will be different for different distributions, possibly including distributions releases. Handling it in the kernel is not *that* complicated and it passed my tests with no hassles. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-15 10:26 ` [PATCH 0/4] fix depvpts " Eric W. Biederman [not found] ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-03-15 14:00 ` Serge Hallyn 2013-03-15 14:42 ` Glauber Costa 1 sibling, 1 reply; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 14:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Glauber Costa, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn, linux-fsdevel, containers Quoting Eric W. Biederman (ebiederm@xmission.com): > Glauber Costa <glommer@parallels.com> writes: > > > Hi, > > > > devpts mounts in user namespaces is queued for 3.9. However, while playing > > with it I found it to be less than ideal. Although it could possibly work > > with custom software that can be made to point to /dev/pts/ptmx, a few things > > prevent it from working correctly for people that, like us, are booting full > > distributions. > > Full distributions that have not been modified to be minimally container > aware. Right, in fact in this case it doesn't need to be minimally container aware, you just create the bind mount yourself and init just needs to accept that it shouldn't touch it. > > In those scenarios, things like udev will kick in, maybe remount /dev undoing > > any setup we might have done, and then software like sshd or anything else > > calling openpty will search for /dev/ptmx, not /dev/pts/ptmx. > > I believe udev stopped running in containers a year or so ago. No, udev runs fine in containers, we just don't allow udevadm trigger. > > One of the problems that I am addressing in here is that we are disallowing > > mknod in usernamespaces. Although I understand the motivation for that, I > > believe that to be too restrictive, specially because we already control access > > to the files separately. There should be no harm in mknod'ing something per se, > > if manipulating it is forbidden. > > mknod in userspace needs to be a separate patchset. There is no need to > solve mknod in userspace to solve devpts. > > > > Last, /dev/ptmx will still always be the global ptmx device. We need to somehow > > link it to our namespaces'. My proposal is to multiplex it and return the > > correct "root ptmx" depending on which userns is reading that device. > > Doable. I still strongly prefer my version of having /dev/ptmx act like > a link to /dev/pts/ptmx. Letting the mount namespace control it. Right, Glauber have you seen this patch? Eric did already solve this. (And again that's a nice safeguard, but it shouldn't be necessary) > In testing that works, and it allows a lot of devpts complexity to just > go away. For older versions of udev you can even configure them with a > rule to make /dev/ptmx a symlink to /dev/pts/ptmx. Newer versions of > udev completely gave up on creating devices and can longer be configured > to do anything useful in this regard. > > So we might even be able to just get away with a bit of udev and > devtmpfs configuration. devtmpfs? Until we get multiple separate mounts of devtmpfs, don't use it in a container :) > And treat devpts as if newinstance is always > specified. Certainly that has worked in my testing so far. > > Eric -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-15 14:00 ` Serge Hallyn @ 2013-03-15 14:42 ` Glauber Costa [not found] ` <5143333E.1040100-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: Glauber Costa @ 2013-03-15 14:42 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On 03/15/2013 06:00 PM, Serge Hallyn wrote: > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: >> >>> Hi, >>> >>> devpts mounts in user namespaces is queued for 3.9. However, while playing >>> with it I found it to be less than ideal. Although it could possibly work >>> with custom software that can be made to point to /dev/pts/ptmx, a few things >>> prevent it from working correctly for people that, like us, are booting full >>> distributions. >> >> Full distributions that have not been modified to be minimally container >> aware. > > Right, in fact in this case it doesn't need to be minimally container > aware, you just create the bind mount yourself and init just needs to > accept that it shouldn't touch it. > Well, what if it doesn't? At least in the system I am using, centos6, udev mounts a tmpfs in a temporary location, and then mount --move this to /dev. This is now empty, and devpts will be mounted ontop of that. Although it can be changed, of course, it is very likely to be due to its age. And that is not even the oldest distribution around. Now, both operations are totally valid inside namespaces. Both mount --move and mounting tmpfs. If there were any way to identify those specific mounts and block them, I would be fine. But so far, given my understanding, you guys are asking me to either go convince people to change their very old stable distributions, or complicate deployment with all sorts of special cases for them. I fully agree that the behavior you describe is the best behavior if it can be done, but I am not satisfied with the answer that legacy distributions should somehow be adapted. Let me reverse the question: If you bind mount /dev/pts and then udev never touches it, etc, does my solution affects that in anyway? The way I see it, we just become more capable of running legacy system without giving nothing in return aside from code. And it is not even an extremely complex code. >>> One of the problems that I am addressing in here is that we are disallowing >>> mknod in usernamespaces. Although I understand the motivation for that, I >>> believe that to be too restrictive, specially because we already control access >>> to the files separately. There should be no harm in mknod'ing something per se, >>> if manipulating it is forbidden. >> >> mknod in userspace needs to be a separate patchset. There is no need to >> solve mknod in userspace to solve devpts. >> >> >>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow >>> link it to our namespaces'. My proposal is to multiplex it and return the >>> correct "root ptmx" depending on which userns is reading that device. >> >> Doable. I still strongly prefer my version of having /dev/ptmx act like >> a link to /dev/pts/ptmx. Letting the mount namespace control it. > > Right, Glauber have you seen this patch? Eric did already solve this. > (And again that's a nice safeguard, but it shouldn't be necessary) > No. Where was that sent to? If you can point me to it, I am of course willing to test it. If it solves my problem (the description suggests that there is high probability), then I have no particular attachments to my specific solution. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <5143333E.1040100-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH 0/4] fix depvpts in user namespaces [not found] ` <5143333E.1040100-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 15:21 ` Serge Hallyn 2013-03-15 15:26 ` Glauber Costa 0 siblings, 1 reply; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 15:21 UTC (permalink / raw) To: Glauber Costa Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > On 03/15/2013 06:00 PM, Serge Hallyn wrote: > > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > >> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > >> > >>> Hi, > >>> > >>> devpts mounts in user namespaces is queued for 3.9. However, while playing > >>> with it I found it to be less than ideal. Although it could possibly work > >>> with custom software that can be made to point to /dev/pts/ptmx, a few things > >>> prevent it from working correctly for people that, like us, are booting full > >>> distributions. > >> > >> Full distributions that have not been modified to be minimally container > >> aware. > > > > Right, in fact in this case it doesn't need to be minimally container > > aware, you just create the bind mount yourself and init just needs to > > accept that it shouldn't touch it. > > > > Well, what if it doesn't? > > At least in the system I am using, centos6, udev mounts a tmpfs in a > temporary location, and then mount --move this to /dev. This is now > empty, and devpts will be mounted ontop of that. This also messes up your /dev/ttyN setup right? How are you handling that? Usually in cases like this I've found that either (1) pre-mounting the fs caused the container-unaware service to not mount it (which in this case isn't happening) or (2) the service might tolerate my preventing the offensive mount (using apparmor by pathname, you could also do it with a special selinux label on /dev). But like you say in this specific case I see no reason not to fix it in the kernel. > Although it can be changed, of course, it is very likely to be due to > its age. And that is not even the oldest distribution around. > > Now, both operations are totally valid inside namespaces. Both mount > --move and mounting tmpfs. If there were any way to identify those > specific mounts and block them, I would be fine. > > But so far, given my understanding, you guys are asking me to either go > convince people to change their very old stable distributions, or > complicate deployment with all sorts of special cases for them. > > I fully agree that the behavior you describe is the best behavior if it > can be done, but I am not satisfied with the answer that legacy > distributions should somehow be adapted. > > Let me reverse the question: If you bind mount /dev/pts and then udev > never touches it, etc, does my solution affects that in anyway? The way > I see it, we just become more capable of running legacy system without > giving nothing in return aside from code. And it is not even an > extremely complex code. Right - I don't object to your patch, I just wanted to see if you and Eric could agree on which one to use. After that I'll do a closer review - but on first look it looked good to me. > >>> One of the problems that I am addressing in here is that we are disallowing > >>> mknod in usernamespaces. Although I understand the motivation for that, I > >>> believe that to be too restrictive, specially because we already control access > >>> to the files separately. There should be no harm in mknod'ing something per se, > >>> if manipulating it is forbidden. > >> > >> mknod in userspace needs to be a separate patchset. There is no need to > >> solve mknod in userspace to solve devpts. > >> > >> > >>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow > >>> link it to our namespaces'. My proposal is to multiplex it and return the > >>> correct "root ptmx" depending on which userns is reading that device. > >> > >> Doable. I still strongly prefer my version of having /dev/ptmx act like > >> a link to /dev/pts/ptmx. Letting the mount namespace control it. > > > > Right, Glauber have you seen this patch? Eric did already solve this. > > (And again that's a nice safeguard, but it shouldn't be necessary) > > > No. Where was that sent to? > > If you can point me to it, I am of course willing to test it. If it > solves my problem (the description suggests that there is high > probability), then I have no particular attachments to my specific solution. Well shoot, I can't find it right now. Not even in Eric's git tree. IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx and open that instead. -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-15 15:21 ` Serge Hallyn @ 2013-03-15 15:26 ` Glauber Costa [not found] ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 21:02 ` Eric W. Biederman 0 siblings, 2 replies; 28+ messages in thread From: Glauber Costa @ 2013-03-15 15:26 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn, linux-fsdevel, containers On 03/15/2013 07:21 PM, Serge Hallyn wrote: > Quoting Glauber Costa (glommer@parallels.com): >> On 03/15/2013 06:00 PM, Serge Hallyn wrote: >>> Quoting Eric W. Biederman (ebiederm@xmission.com): >>>> Glauber Costa <glommer@parallels.com> writes: >>>> >>>>> Hi, >>>>> >>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing >>>>> with it I found it to be less than ideal. Although it could possibly work >>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things >>>>> prevent it from working correctly for people that, like us, are booting full >>>>> distributions. >>>> >>>> Full distributions that have not been modified to be minimally container >>>> aware. >>> >>> Right, in fact in this case it doesn't need to be minimally container >>> aware, you just create the bind mount yourself and init just needs to >>> accept that it shouldn't touch it. >>> >> >> Well, what if it doesn't? >> >> At least in the system I am using, centos6, udev mounts a tmpfs in a >> temporary location, and then mount --move this to /dev. This is now >> empty, and devpts will be mounted ontop of that. > > This also messes up your /dev/ttyN setup right? How are you handling > that? > very simple: udev will just mknod everything back, so let him! >> Let me reverse the question: If you bind mount /dev/pts and then udev >> never touches it, etc, does my solution affects that in anyway? The way >> I see it, we just become more capable of running legacy system without >> giving nothing in return aside from code. And it is not even an >> extremely complex code. > > Right - I don't object to your patch, I just wanted to see if you and > Eric could agree on which one to use. After that I'll do a closer > review - but on first look it looked good to me. > As I said, as far as the specific part of the puzzle concerning the /dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works. >>>>> One of the problems that I am addressing in here is that we are disallowing >>>>> mknod in usernamespaces. Although I understand the motivation for that, I >>>>> believe that to be too restrictive, specially because we already control access >>>>> to the files separately. There should be no harm in mknod'ing something per se, >>>>> if manipulating it is forbidden. >>>> >>>> mknod in userspace needs to be a separate patchset. There is no need to >>>> solve mknod in userspace to solve devpts. >>>> >>>> >>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow >>>>> link it to our namespaces'. My proposal is to multiplex it and return the >>>>> correct "root ptmx" depending on which userns is reading that device. >>>> >>>> Doable. I still strongly prefer my version of having /dev/ptmx act like >>>> a link to /dev/pts/ptmx. Letting the mount namespace control it. >>> >>> Right, Glauber have you seen this patch? Eric did already solve this. >>> (And again that's a nice safeguard, but it shouldn't be necessary) >>> >> No. Where was that sent to? >> >> If you can point me to it, I am of course willing to test it. If it >> solves my problem (the description suggests that there is high >> probability), then I have no particular attachments to my specific solution. > > Well shoot, I can't find it right now. Not even in Eric's git tree. > IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx > and open that instead. > Which gives a very good explanation about why haven't I seen it =) Eric ? What it a /dev/ptmx already exist? will it use it? That would be bad, since that /dev/ptmx could be a host-side one. I actually believe linking to $rootfs/dev/pts/ptmx is more robust than my solution against remounts. So provided it can guarantee that the ptmx is not ever the root ptmx, I would ack that. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH 0/4] fix depvpts in user namespaces [not found] ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 15:58 ` Serge Hallyn 2013-03-15 16:01 ` Glauber Costa 0 siblings, 1 reply; 28+ messages in thread From: Serge Hallyn @ 2013-03-15 15:58 UTC (permalink / raw) To: Glauber Costa Cc: Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > On 03/15/2013 07:21 PM, Serge Hallyn wrote: > > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): > >> On 03/15/2013 06:00 PM, Serge Hallyn wrote: > >>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > >>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > >>>> > >>>>> Hi, > >>>>> > >>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing > >>>>> with it I found it to be less than ideal. Although it could possibly work > >>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things > >>>>> prevent it from working correctly for people that, like us, are booting full > >>>>> distributions. > >>>> > >>>> Full distributions that have not been modified to be minimally container > >>>> aware. > >>> > >>> Right, in fact in this case it doesn't need to be minimally container > >>> aware, you just create the bind mount yourself and init just needs to > >>> accept that it shouldn't touch it. > >>> > >> > >> Well, what if it doesn't? > >> > >> At least in the system I am using, centos6, udev mounts a tmpfs in a > >> temporary location, and then mount --move this to /dev. This is now > >> empty, and devpts will be mounted ontop of that. > > > > This also messes up your /dev/ttyN setup right? How are you handling > > that? > > > very simple: udev will just mknod everything back, so let him! So you're not using bind-mounted ptys over /dev/ttyN? > >> Let me reverse the question: If you bind mount /dev/pts and then udev > >> never touches it, etc, does my solution affects that in anyway? The way > >> I see it, we just become more capable of running legacy system without > >> giving nothing in return aside from code. And it is not even an > >> extremely complex code. > > > > Right - I don't object to your patch, I just wanted to see if you and > > Eric could agree on which one to use. After that I'll do a closer > > review - but on first look it looked good to me. > > > > As I said, as far as the specific part of the puzzle concerning the > /dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works. ditto. > >>>>> One of the problems that I am addressing in here is that we are disallowing > >>>>> mknod in usernamespaces. Although I understand the motivation for that, I > >>>>> believe that to be too restrictive, specially because we already control access > >>>>> to the files separately. There should be no harm in mknod'ing something per se, > >>>>> if manipulating it is forbidden. > >>>> > >>>> mknod in userspace needs to be a separate patchset. There is no need to > >>>> solve mknod in userspace to solve devpts. > >>>> > >>>> > >>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow > >>>>> link it to our namespaces'. My proposal is to multiplex it and return the > >>>>> correct "root ptmx" depending on which userns is reading that device. > >>>> > >>>> Doable. I still strongly prefer my version of having /dev/ptmx act like > >>>> a link to /dev/pts/ptmx. Letting the mount namespace control it. > >>> > >>> Right, Glauber have you seen this patch? Eric did already solve this. > >>> (And again that's a nice safeguard, but it shouldn't be necessary) > >>> > >> No. Where was that sent to? > >> > >> If you can point me to it, I am of course willing to test it. If it > >> solves my problem (the description suggests that there is high > >> probability), then I have no particular attachments to my specific solution. > > > > Well shoot, I can't find it right now. Not even in Eric's git tree. > > IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx > > and open that instead. > > > Which gives a very good explanation about why haven't I seen it =) > Eric ? > > What it a /dev/ptmx already exist? will it use it? That would be bad, No, I think it was the open handler on the ptmx node that did it. If the fs underlying the ptmx file wasn't devpts then it would do the path lookup. So then if /dev/pts/ptmx was actually the host one, then you get what you asked for, but it's not like a symlink (where the link could be to host /dev/pts/ptmx whereas current task has a different /dev/pts/ptmx) > since that /dev/ptmx could be a host-side one. I actually believe > linking to $rootfs/dev/pts/ptmx is more robust than my solution against > remounts. So provided it can guarantee that the ptmx is not ever the > root ptmx, I would ack that. > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-15 15:58 ` Serge Hallyn @ 2013-03-15 16:01 ` Glauber Costa 0 siblings, 0 replies; 28+ messages in thread From: Glauber Costa @ 2013-03-15 16:01 UTC (permalink / raw) To: Serge Hallyn Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w On 03/15/2013 07:58 PM, Serge Hallyn wrote: > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): >> On 03/15/2013 07:21 PM, Serge Hallyn wrote: >>> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org): >>>> On 03/15/2013 06:00 PM, Serge Hallyn wrote: >>>>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >>>>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing >>>>>>> with it I found it to be less than ideal. Although it could possibly work >>>>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things >>>>>>> prevent it from working correctly for people that, like us, are booting full >>>>>>> distributions. >>>>>> >>>>>> Full distributions that have not been modified to be minimally container >>>>>> aware. >>>>> >>>>> Right, in fact in this case it doesn't need to be minimally container >>>>> aware, you just create the bind mount yourself and init just needs to >>>>> accept that it shouldn't touch it. >>>>> >>>> >>>> Well, what if it doesn't? >>>> >>>> At least in the system I am using, centos6, udev mounts a tmpfs in a >>>> temporary location, and then mount --move this to /dev. This is now >>>> empty, and devpts will be mounted ontop of that. >>> >>> This also messes up your /dev/ttyN setup right? How are you handling >>> that? >>> >> very simple: udev will just mknod everything back, so let him! > > So you're not using bind-mounted ptys over /dev/ttyN? > Not in particular, and I haven't felt the need yet. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-15 15:26 ` Glauber Costa [not found] ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2013-03-15 21:02 ` Eric W. Biederman 2013-03-18 3:20 ` Serge Hallyn 1 sibling, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2013-03-15 21:02 UTC (permalink / raw) To: Glauber Costa Cc: Serge Hallyn, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn, linux-fsdevel, containers Glauber Costa <glommer@parallels.com> writes: > On 03/15/2013 07:21 PM, Serge Hallyn wrote: >> Quoting Glauber Costa (glommer@parallels.com): >>> On 03/15/2013 06:00 PM, Serge Hallyn wrote: >>>> Quoting Eric W. Biederman (ebiederm@xmission.com): >>>>> Glauber Costa <glommer@parallels.com> writes: >>>>> >> Well shoot, I can't find it right now. Not even in Eric's git tree. >> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx >> and open that instead. >> > Which gives a very good explanation about why haven't I seen it =) > Eric ? It is definitely in the development branches of my git tree. I have half a dozen patches that touch devpts. I believe the latest dev branch I have published is userns-always-map-user-v110. And the code is in there. They have not reached the top of my queue in importance at this time, and last time I was testing them there was a subtle race in something. I expect it was just a change in pty layer that I haven't followed closely enough. > What it a /dev/ptmx already exist? will it use it? That would be bad, > since that /dev/ptmx could be a host-side one. I actually believe > linking to $rootfs/dev/pts/ptmx is more robust than my solution against > remounts. So provided it can guarantee that the ptmx is not ever the > root ptmx, I would ack that. For those playing with udev, especially older udev where udev is still udev and creates devices you can use the following udev rule to create the pts/ptmx symlink. KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx" Before we do anything clever in the kernel it is definitely worth seeing how far we can take that little udev rule. I have heard of no container that runs a distro's initrd, or uses the distro's code to mount root. So containers even for old distro's can and do tweak the distro's a little. It is worth keeping the tweaks small but udev and their kin are an important bit of that. As much as I hate the notion I suspect for most of device management what we want is to act like devtmpfs, and run all of the device node creation etc outside of the container (possibly even with bind mounts). Acting like devtmpfs should be something that is possible with no kernel changes. Whereas allowing unprivileged processes to create device nodes probably has issues I haven't thought of yet. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-15 21:02 ` Eric W. Biederman @ 2013-03-18 3:20 ` Serge Hallyn 2013-03-18 21:23 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Serge Hallyn @ 2013-03-18 3:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Glauber Costa, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn, linux-fsdevel, containers Quoting Eric W. Biederman (ebiederm@xmission.com): > Glauber Costa <glommer@parallels.com> writes: ... > > What it a /dev/ptmx already exist? will it use it? That would be bad, > > since that /dev/ptmx could be a host-side one. I actually believe > > linking to $rootfs/dev/pts/ptmx is more robust than my solution against > > remounts. So provided it can guarantee that the ptmx is not ever the > > root ptmx, I would ack that. > > For those playing with udev, especially older udev where udev is still > udev and creates devices you can use the following udev rule to create > the pts/ptmx symlink. > > KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx" > > Before we do anything clever in the kernel it is definitely worth seeing > how far we can take that little udev rule. Before it was decided that it was ok to modify core packages to accomodate containers, we had to install (non-standard) init jobs to detect it was in a container and if so modify some behavior - for instance to bind-mount a smaller /lib/init/fstab so that mountall wouldn't try to mount some things. That way the rootfs had to be updated to run in a container, but could then still be used as a rootfs for non-containers. ... > As much as I hate the notion I suspect for most of device management > what we want is to act like devtmpfs, and run all of the device node > creation etc outside of the container (possibly even with bind mounts). So you mean a task which is unprivileged on the host, privileged wrt the container, and on host fs namespace, which bind mounts the host /dev files into the container? > Acting like devtmpfs should be something that is possible with no kernel > changes. Whereas allowing unprivileged processes to create device > nodes probably has issues I haven't thought of yet. Not sure what 'acting like devtmpfs' means (especially in contrast to acting like udev) - maybe i need to go look at the code. -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] fix depvpts in user namespaces 2013-03-18 3:20 ` Serge Hallyn @ 2013-03-18 21:23 ` Eric W. Biederman 0 siblings, 0 replies; 28+ messages in thread From: Eric W. Biederman @ 2013-03-18 21:23 UTC (permalink / raw) To: Serge Hallyn Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > ... >> > What it a /dev/ptmx already exist? will it use it? That would be bad, >> > since that /dev/ptmx could be a host-side one. I actually believe >> > linking to $rootfs/dev/pts/ptmx is more robust than my solution against >> > remounts. So provided it can guarantee that the ptmx is not ever the >> > root ptmx, I would ack that. >> >> For those playing with udev, especially older udev where udev is still >> udev and creates devices you can use the following udev rule to create >> the pts/ptmx symlink. >> >> KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx" >> >> Before we do anything clever in the kernel it is definitely worth seeing >> how far we can take that little udev rule. > > Before it was decided that it was ok to modify core packages to > accomodate containers, we had to install (non-standard) init jobs to > detect it was in a container and if so modify some behavior - for > instance to bind-mount a smaller /lib/init/fstab so that mountall > wouldn't try to mount some things. That way the rootfs had to be > updated to run in a container, but could then still be used as a > rootfs for non-containers. That little udev rule could be one of those trivial modifications. > ... > >> As much as I hate the notion I suspect for most of device management >> what we want is to act like devtmpfs, and run all of the device node >> creation etc outside of the container (possibly even with bind mounts). > > So you mean a task which is unprivileged on the host, privileged wrt the > container, and on host fs namespace, which bind mounts the host /dev > files into the container? That would be the implementation. Although you could potentialy have something privileged on the host doing the work from outside of the container. >> Acting like devtmpfs should be something that is possible with no kernel >> changes. Whereas allowing unprivileged processes to create device >> nodes probably has issues I haven't thought of yet. > > Not sure what 'acting like devtmpfs' means (especially in contrast to > acting like udev) - maybe i need to go look at the code. devtmpfs is a magic kernel thread and instance of tmpfs that calls mknod based on the default device name and permission policy that is new in the kernel. Which means that any distro that can use devmpts has no problems with something else calling mknod and creating the devices. And the initrd is expected to have mounted devtmpfs last I looked. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2013-03-18 21:23 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-15 9:13 [PATCH 0/4] fix depvpts in user namespaces Glauber Costa [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 9:13 ` [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup Glauber Costa [not found] ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 14:07 ` Serge Hallyn 2013-03-15 14:43 ` Glauber Costa 2013-03-15 14:55 ` Serge Hallyn 2013-03-15 19:27 ` Aristeu Rozanski 2013-03-15 9:13 ` [PATCH 2/4] fs: allow dev accesses in userns in controlled situations Glauber Costa 2013-03-15 14:20 ` Serge Hallyn 2013-03-15 9:13 ` [PATCH 3/4] fs: allow mknod in user namespaces Glauber Costa [not found] ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 14:37 ` Serge Hallyn 2013-03-15 14:49 ` Glauber Costa [not found] ` <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 15:14 ` Serge Hallyn 2013-03-15 18:03 ` Vasily Kulikov 2013-03-15 20:43 ` Eric W. Biederman 2013-03-16 0:23 ` Serge Hallyn 2013-03-15 9:13 ` [PATCH 4/4] devpts: fix usage " Glauber Costa [not found] ` <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 14:45 ` Serge Hallyn 2013-03-15 10:26 ` [PATCH 0/4] fix depvpts " Eric W. Biederman [not found] ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-03-15 12:01 ` Glauber Costa 2013-03-15 14:00 ` Serge Hallyn 2013-03-15 14:42 ` Glauber Costa [not found] ` <5143333E.1040100-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 15:21 ` Serge Hallyn 2013-03-15 15:26 ` Glauber Costa [not found] ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2013-03-15 15:58 ` Serge Hallyn 2013-03-15 16:01 ` Glauber Costa 2013-03-15 21:02 ` Eric W. Biederman 2013-03-18 3:20 ` Serge Hallyn 2013-03-18 21:23 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).