* user namespace and fully visible proc and sys mounts @ 2016-03-06 8:28 Serge E. Hallyn 2016-03-06 21:53 ` Eric W. Biederman 0 siblings, 1 reply; 10+ messages in thread From: Serge E. Hallyn @ 2016-03-06 8:28 UTC (permalink / raw) To: Eric W. Biederman, lkml, Seth Forshee, Stéphane Graber, serge, Andy Lutomirski Hi, So we've been over this many times... but unfortunately there is more breakage to report. Regular privileged and unprivileged containers work all right for us. But running an unprivileged container inside a privileged container is blocked. When creating privileged containers, lxc by default does a few things: it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly (because this container is not in a user namespace) then moves /proc/sys/net back. Finally it mounts sys ro but bind-mounts /sys/devices/virtual/net as writeable. If any of these are left enabled, unprivileged containers can't be started. If all are disabled, then they can be. Can we find a way to make these not block remounts in child user namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? -serge ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-06 8:28 user namespace and fully visible proc and sys mounts Serge E. Hallyn @ 2016-03-06 21:53 ` Eric W. Biederman 2016-03-06 23:38 ` Serge E. Hallyn 2016-03-07 2:24 ` Andy Lutomirski 0 siblings, 2 replies; 10+ messages in thread From: Eric W. Biederman @ 2016-03-06 21:53 UTC (permalink / raw) To: Serge E. Hallyn Cc: lkml, Seth Forshee, Stéphane Graber, serge, Andy Lutomirski "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes: > Hi, > > So we've been over this many times... but unfortunately there is more > breakage to report. Regular privileged and unprivileged containers > work all right for us. But running an unprivileged container inside a > privileged container is blocked. > > When creating privileged containers, lxc by default does a few things: > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > (because this container is not in a user namespace) then moves > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > /sys/devices/virtual/net as writeable. > > If any of these are left enabled, unprivileged containers can't be > started. If all are disabled, then they can be. > > Can we find a way to make these not block remounts in child user > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? Are any of these overmounts done for the purpose of security? It appears the /proc/sys and /sys mounts being made read-only is for that purpose. If none of the mounts are for secuirty the easy solution that works today is to also mount /proc and /sys somewhere else in your container so that the permission check for mounting a new copy passes. That said /proc/sys appears to be a show stopper in this scheme. As the root of your privileged container can enter your unprivileged container it can bypass your read-only /proc/sys by mounting a new copy of proc if we allow the relaxation you are requesting. Therefore the only choice on the table (and I don't have a clue how realistic it is) is to have a variant of proc with just files describing processes. Call it processfs. That would not need the current restrictions. As for sysfs I am drawing a blank about what might be possible. Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-06 21:53 ` Eric W. Biederman @ 2016-03-06 23:38 ` Serge E. Hallyn 2016-03-07 2:24 ` Andy Lutomirski 1 sibling, 0 replies; 10+ messages in thread From: Serge E. Hallyn @ 2016-03-06 23:38 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, lkml, Seth Forshee, Stéphane Graber, serge, Andy Lutomirski On Sun, Mar 06, 2016 at 03:53:40PM -0600, Eric W. Biederman wrote: > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes: > > > Hi, > > > > So we've been over this many times... but unfortunately there is more > > breakage to report. Regular privileged and unprivileged containers > > work all right for us. But running an unprivileged container inside a > > privileged container is blocked. > > > > When creating privileged containers, lxc by default does a few things: > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > > (because this container is not in a user namespace) then moves > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > > /sys/devices/virtual/net as writeable. > > > > If any of these are left enabled, unprivileged containers can't be > > started. If all are disabled, then they can be. > > > > Can we find a way to make these not block remounts in child user > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? > > Are any of these overmounts done for the purpose of security? It The fuse.lxcfs ones are not for security. The others are for security, but only in non-user-namespaced containers. (We're doing them in unprivileged as well for simplicity but could stop that). We're not overmounting to hide things, we're mounting readonly because the procfiles are owned by the same uid that is root in the container. Now in Ubuntu we do also have precise apparmor profiles which redundantly prevent writing, and our only real goal is to prevent accidental host damage, but the defense in depth is still nice to have, and I don't want to drop that. > appears the /proc/sys and /sys mounts being made read-only is for that > purpose. Right, but we're not hiding anything. In fact maybe that's how we can detect this - if the dentry over- and under-mount for a directory is the same, ignore it, because it doesn't fall under your original thread scenario? > If none of the mounts are for secuirty the easy solution that works > today is to also mount /proc and /sys somewhere else in your container > so that the permission check for mounting a new copy passes. Yeah, we used to do that, and I actually forgot that we used to do that. I'll have to look into why it no longer suffices. (The security aspect wasn't too bad, since we used apparmor to prevent any writes to the redundant mounts) > That said /proc/sys appears to be a show stopper in this scheme. As the > root of your privileged container can enter your unprivileged container > it can bypass your read-only /proc/sys by mounting a new copy of proc if > we allow the relaxation you are requesting. Yeah, will have to think about that. > Therefore the only choice on the table (and I don't have a clue how > realistic it is) is to have a variant of proc with just files describing > processes. Call it processfs. That would not need the current > restrictions. > > As for sysfs I am drawing a blank about what might be possible. > > Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-06 21:53 ` Eric W. Biederman 2016-03-06 23:38 ` Serge E. Hallyn @ 2016-03-07 2:24 ` Andy Lutomirski 2016-03-07 3:45 ` Serge E. Hallyn 2016-03-08 0:07 ` Eric W. Biederman 1 sibling, 2 replies; 10+ messages in thread From: Andy Lutomirski @ 2016-03-07 2:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: > > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes: > > > Hi, > > > > So we've been over this many times... but unfortunately there is more > > breakage to report. Regular privileged and unprivileged containers > > work all right for us. But running an unprivileged container inside a > > privileged container is blocked. > > > > When creating privileged containers, lxc by default does a few things: > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > > (because this container is not in a user namespace) then moves > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > > /sys/devices/virtual/net as writeable. > > > > If any of these are left enabled, unprivileged containers can't be > > started. If all are disabled, then they can be. > > > > Can we find a way to make these not block remounts in child user > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? > > Are any of these overmounts done for the purpose of security? It > appears the /proc/sys and /sys mounts being made read-only is for that > purpose. > > If none of the mounts are for secuirty the easy solution that works > today is to also mount /proc and /sys somewhere else in your container > so that the permission check for mounting a new copy passes. Can we use the big hammer approach on /proc/sys? Specifically, what if we made it so that /proc mounts created in a non-root namespace *only* see things that are scoped to the active namespaces, and only those over which the mounter has capabilities? We could have mount options for this. /proc/sys utterly sucks for namespaces things. So does the uid_map and similar crap. The API is simply awful. On a related note, can we *please* find a way to constrain namespace creation in a way that might satisfy the RHEL crowd? > > That said /proc/sys appears to be a show stopper in this scheme. As the > root of your privileged container can enter your unprivileged container > it can bypass your read-only /proc/sys by mounting a new copy of proc if > we allow the relaxation you are requesting. > > Therefore the only choice on the table (and I don't have a clue how > realistic it is) is to have a variant of proc with just files describing > processes. Call it processfs. That would not need the current > restrictions. > > As for sysfs I am drawing a blank about what might be possible. Lovely. Yet another vaguely-namespaced thing in a pseudo-filesystem. --Andy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-07 2:24 ` Andy Lutomirski @ 2016-03-07 3:45 ` Serge E. Hallyn 2016-03-07 3:49 ` Andy Lutomirski 2016-03-08 0:07 ` Eric W. Biederman 1 sibling, 1 reply; 10+ messages in thread From: Serge E. Hallyn @ 2016-03-07 3:45 UTC (permalink / raw) To: Andy Lutomirski Cc: Eric W. Biederman, Serge E. Hallyn, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote: > On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: > > > > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes: > > > > > Hi, > > > > > > So we've been over this many times... but unfortunately there is more > > > breakage to report. Regular privileged and unprivileged containers > > > work all right for us. But running an unprivileged container inside a > > > privileged container is blocked. > > > > > > When creating privileged containers, lxc by default does a few things: > > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and > > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > > > (because this container is not in a user namespace) then moves > > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > > > /sys/devices/virtual/net as writeable. > > > > > > If any of these are left enabled, unprivileged containers can't be > > > started. If all are disabled, then they can be. > > > > > > Can we find a way to make these not block remounts in child user > > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? > > > > Are any of these overmounts done for the purpose of security? It > > appears the /proc/sys and /sys mounts being made read-only is for that > > purpose. > > > > If none of the mounts are for secuirty the easy solution that works > > today is to also mount /proc and /sys somewhere else in your container > > so that the permission check for mounting a new copy passes. > > Can we use the big hammer approach on /proc/sys? Specifically, what > if we made it so that /proc mounts created in a non-root namespace > *only* see things that are scoped to the active namespaces, and only > those over which the mounter has capabilities? We could have mount > options for this. Of course the problem is precisely non-user-namespaced containers which do own and have capabilities over the /proc/sys/files. For user-namespaced containers /proc/sys/ isn't really an issue. Better namespacing of sysctls and maybe some way to say "I relinquish the ability to update *those* sysctls for myself and all children" could help. > /proc/sys utterly sucks for namespaces things. So does the uid_map > and similar crap. The API is simply awful. > > On a related note, can we *please* find a way to constrain namespace > creation in a way that might satisfy the RHEL crowd? > > > > > That said /proc/sys appears to be a show stopper in this scheme. As the > > root of your privileged container can enter your unprivileged container > > it can bypass your read-only /proc/sys by mounting a new copy of proc if > > we allow the relaxation you are requesting. > > > > Therefore the only choice on the table (and I don't have a clue how > > realistic it is) is to have a variant of proc with just files describing > > processes. Call it processfs. That would not need the current > > restrictions. > > > > As for sysfs I am drawing a blank about what might be possible. > > Lovely. Yet another vaguely-namespaced thing in a pseudo-filesystem. > > --Andy ` ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-07 3:45 ` Serge E. Hallyn @ 2016-03-07 3:49 ` Andy Lutomirski 2016-03-07 5:03 ` Serge E. Hallyn 0 siblings, 1 reply; 10+ messages in thread From: Andy Lutomirski @ 2016-03-07 3:49 UTC (permalink / raw) To: Serge E. Hallyn Cc: Eric W. Biederman, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote: >> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >> > >> > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes: >> > >> > > Hi, >> > > >> > > So we've been over this many times... but unfortunately there is more >> > > breakage to report. Regular privileged and unprivileged containers >> > > work all right for us. But running an unprivileged container inside a >> > > privileged container is blocked. >> > > >> > > When creating privileged containers, lxc by default does a few things: >> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and >> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as >> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly >> > > (because this container is not in a user namespace) then moves >> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts >> > > /sys/devices/virtual/net as writeable. >> > > >> > > If any of these are left enabled, unprivileged containers can't be >> > > started. If all are disabled, then they can be. >> > > >> > > Can we find a way to make these not block remounts in child user >> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? >> > >> > Are any of these overmounts done for the purpose of security? It >> > appears the /proc/sys and /sys mounts being made read-only is for that >> > purpose. >> > >> > If none of the mounts are for secuirty the easy solution that works >> > today is to also mount /proc and /sys somewhere else in your container >> > so that the permission check for mounting a new copy passes. >> >> Can we use the big hammer approach on /proc/sys? Specifically, what >> if we made it so that /proc mounts created in a non-root namespace >> *only* see things that are scoped to the active namespaces, and only >> those over which the mounter has capabilities? We could have mount >> options for this. > > Of course the problem is precisely non-user-namespaced containers which > do own and have capabilities over the /proc/sys/files. For user-namespaced > containers /proc/sys/ isn't really an issue. What I mean is: mount -o nsonly=user,net -t proc none /proc would show the list of processors and things scoped to the current userns and netns, would *not* show global sysctls, and would fail unless the caller has appropriate caps over the userns and netns. This would work even if the old procfs is not fully visbile. --Andy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-07 3:49 ` Andy Lutomirski @ 2016-03-07 5:03 ` Serge E. Hallyn 0 siblings, 0 replies; 10+ messages in thread From: Serge E. Hallyn @ 2016-03-07 5:03 UTC (permalink / raw) To: Andy Lutomirski Cc: Serge E. Hallyn, Eric W. Biederman, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber On Sun, Mar 06, 2016 at 07:49:14PM -0800, Andy Lutomirski wrote: > On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote: > >> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: > >> > > >> > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes: > >> > > >> > > Hi, > >> > > > >> > > So we've been over this many times... but unfortunately there is more > >> > > breakage to report. Regular privileged and unprivileged containers > >> > > work all right for us. But running an unprivileged container inside a > >> > > privileged container is blocked. > >> > > > >> > > When creating privileged containers, lxc by default does a few things: > >> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and > >> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > >> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > >> > > (because this container is not in a user namespace) then moves > >> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > >> > > /sys/devices/virtual/net as writeable. > >> > > > >> > > If any of these are left enabled, unprivileged containers can't be > >> > > started. If all are disabled, then they can be. > >> > > > >> > > Can we find a way to make these not block remounts in child user > >> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? > >> > > >> > Are any of these overmounts done for the purpose of security? It > >> > appears the /proc/sys and /sys mounts being made read-only is for that > >> > purpose. > >> > > >> > If none of the mounts are for secuirty the easy solution that works > >> > today is to also mount /proc and /sys somewhere else in your container > >> > so that the permission check for mounting a new copy passes. > >> > >> Can we use the big hammer approach on /proc/sys? Specifically, what > >> if we made it so that /proc mounts created in a non-root namespace > >> *only* see things that are scoped to the active namespaces, and only > >> those over which the mounter has capabilities? We could have mount > >> options for this. > > > > Of course the problem is precisely non-user-namespaced containers which > > do own and have capabilities over the /proc/sys/files. For user-namespaced > > containers /proc/sys/ isn't really an issue. > > What I mean is: > > mount -o nsonly=user,net -t proc none /proc > > would show the list of processors and things scoped to the current > userns and netns, would *not* show global sysctls, and would fail > unless the caller has appropriate caps over the userns and netns. > This would work even if the old procfs is not fully visbile. Gah, so apparently I'd forgotten the workaround I'd implemented - I thought things had regressed, but they haven't, I'd just missed a step. Sorry for the noise. I don't want to make things more complicated or more brittle when we can make it work as is - thanks. -serge ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-07 2:24 ` Andy Lutomirski 2016-03-07 3:45 ` Serge E. Hallyn @ 2016-03-08 0:07 ` Eric W. Biederman 2016-03-08 0:24 ` Andy Lutomirski 1 sibling, 1 reply; 10+ messages in thread From: Eric W. Biederman @ 2016-03-08 0:07 UTC (permalink / raw) To: Andy Lutomirski Cc: Serge E. Hallyn, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber Andy Lutomirski <luto@amacapital.net> writes: > On a related note, can we *please* find a way to constrain namespace > creation in a way that might satisfy the RHEL crowd? I am not certain to what you are referrring. As long as folks are willing to work with me I am happy to help design and design something that makes things better for everyone. If someone pushes hard, suggestes crappy patches, and does not listen to constructive feedback I will shoot their patches down (especially when I am sick and tired as I have been more than I would like this development cycle). Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-08 0:07 ` Eric W. Biederman @ 2016-03-08 0:24 ` Andy Lutomirski 2016-03-08 4:05 ` Eric W. Biederman 0 siblings, 1 reply; 10+ messages in thread From: Andy Lutomirski @ 2016-03-08 0:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber On Mon, Mar 7, 2016 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Andy Lutomirski <luto@amacapital.net> writes: > >> On a related note, can we *please* find a way to constrain namespace >> creation in a way that might satisfy the RHEL crowd? > > I am not certain to what you are referrring. > > As long as folks are willing to work with me I am happy to help design > and design something that makes things better for everyone. If someone > pushes hard, suggestes crappy patches, and does not listen to > constructive feedback I will shoot their patches down (especially when I > am sick and tired as I have been more than I would like this development > cycle). I think we should add some mechanism that will allow the right to create various namespaces to be constrained in a useful and usable manner. I'll start a new thread. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user namespace and fully visible proc and sys mounts 2016-03-08 0:24 ` Andy Lutomirski @ 2016-03-08 4:05 ` Eric W. Biederman 0 siblings, 0 replies; 10+ messages in thread From: Eric W. Biederman @ 2016-03-08 4:05 UTC (permalink / raw) To: Andy Lutomirski Cc: Serge E. Hallyn, Serge Hallyn, Seth Forshee, lkml, Stéphane Graber Andy Lutomirski <luto@amacapital.net> writes: > On Mon, Mar 7, 2016 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> >>> On a related note, can we *please* find a way to constrain namespace >>> creation in a way that might satisfy the RHEL crowd? >> >> I am not certain to what you are referrring. >> >> As long as folks are willing to work with me I am happy to help design >> and design something that makes things better for everyone. If someone >> pushes hard, suggestes crappy patches, and does not listen to >> constructive feedback I will shoot their patches down (especially when I >> am sick and tired as I have been more than I would like this development >> cycle). > > I think we should add some mechanism that will allow the right to > create various namespaces to be constrained in a useful and usable > manner. I'll start a new thread. On the general principle that there is more attack surface, and attack surface reduction is generally good I agree. I will await your follow on thread when you are ready. Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-03-08 4:15 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-06 8:28 user namespace and fully visible proc and sys mounts Serge E. Hallyn 2016-03-06 21:53 ` Eric W. Biederman 2016-03-06 23:38 ` Serge E. Hallyn 2016-03-07 2:24 ` Andy Lutomirski 2016-03-07 3:45 ` Serge E. Hallyn 2016-03-07 3:49 ` Andy Lutomirski 2016-03-07 5:03 ` Serge E. Hallyn 2016-03-08 0:07 ` Eric W. Biederman 2016-03-08 0:24 ` Andy Lutomirski 2016-03-08 4:05 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox