From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) Subject: Re: [RFC][PATCH 0/9] Make containers kernel objects Date: Tue, 23 May 2017 07:54:02 -0500 Message-ID: <874lwbraxh.fsf@xmission.com> References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk> <87lgpoww67.fsf@xmission.com> <1495491733.25946.3.camel@redhat.com> Mime-Version: 1.0 Return-path: In-Reply-To: <1495491733.25946.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> (Jeff Layton's message of "Mon, 22 May 2017 18:22:13 -0400") Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Jeff Layton Cc: David Howells , trondmy-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org, mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Jeff Layton writes: > On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote: >> David Howells writes: >> >> > Here are a set of patches to define a container object for the kernel and >> > to provide some methods to create and manipulate them. >> > >> > The reason I think this is necessary is that the kernel has no idea how to >> > direct upcalls to what userspace considers to be a container - current >> > Linux practice appears to make a "container" just an arbitrarily chosen >> > junction of namespaces, control groups and files, which may be changed >> > individually within the "container". >> > >> >> I think this might possibly be a useful abstraction for solving the >> keyring upcalls if it was something created implicitly. >> >> fork_into_container for use by keyring upcalls is currently a security >> vulnerability as it allows escaping all of a containers cgroups. But >> you have that on your list of things to fix. However you don't have >> seccomp and a few other things. >> >> Before we had kthreadd in the kernel upcalls always had issues because >> the code to reset all of the userspace bits and make the forked >> task suitable for running upcalls was always missing some detail. It is >> a very bug-prone kind of idiom that you are talking about. It is doubly >> bug-prone because the wrongness is visible to userspace and as such >> might get become a frozen KABI guarantee. >> >> Let me suggest a concrete alternative: >> >> - At the time of mount observer the mounters user namespace. >> - Find the mounters pid namespace. >> - If the mounters pid namespace is owned by the mounters user namespace >> walk up the pid namespace tree to the first pid namespace owned by >> that user namespace. >> - If the mounters pid namespace is not owned by the mounters user >> namespace fail the mount it is going to need to make upcalls as >> will not be possible. >> - Hold a reference to the pid namespace that was found. >> >> Then when an upcall needs to be made fork a child of the init process >> of the specified pid namespace. Or fail if the init process of the >> pid namespace has died. >> >> That should always work and it does not require keeping expensive state >> where we did not have it previously. Further because the semantics are >> fork a child of a particular pid namespace's init as features get added >> to the kernel this code remains well defined. >> >> For ordinary request-key upcalls we should be able to use the same rules >> and just not save/restore things in the kernel. >> > > OK, that does seem like a reasonable idea. Note that it's not just > request-key upcalls here that we're interested in, but anything that > we'd typically spawn from kthreadd otherwise. General user mode helper *Nod*. > That said, I worry a little about this. If the init process does a setns > at the wrong time, suddenly you're doing the upcall in different > namespaces than you intended. > > Might it be better to use the init process of the container as the > template like you suggest, but snapshot its "context" at a particular > point in time instead? > > knfsd could do this when it's started, for instance... The danger of a snapshot it time is something important (like cgroup membership) might change. It might be necessary to have this be an opt-in. Perhaps even to the point of starting a dedicated kthreadd. Right now I think we need to figure out what it will take to solve this in the kernel because I strongly suspect that solving this in userspace is a cop out and we really aren't providing enough information to userspace to run the helper in the proper context. And I strongly suspect that providing enough information from the kernel will be roughly equivalent to solving this in the kernel. The only big issue I have had with the suggestion of a dedicated thread in the past is the overhead something like that will breing with it. Eric