Linux userland API discussions

* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Andy Lutomirski @ 2019-04-15 23:58 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Andy Lutomirski, Aleksa Sarai, Enrico Weigelt, metux IT consult,
	Christian Brauner, Linus Torvalds, Al Viro, Jann Horn,
	David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
	Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel
In-Reply-To: <CAGLj2rEOfW=3wis3YWomWK1AVDdCX6DMqbvSHXba3-x=kpkcNA@mail.gmail.com>

On Mon, Apr 15, 2019 at 2:26 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski <luto@kernel.org> wrote:
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes.  And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit.  I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable.  I suppose it would be nifty if there were a way for a
>
> Hmm, isn't what you're describing roughly what systemd-run -t does? It
> will serialize the argument list, ask PID 1 to create a transient unit
> (go through the polkit stuff), and then set the stdout/stderr and
> stdin of the service to your tty, make it the controlling terminal of
> the process and
> reset it. So I guess it should work with sudo/su just fine too.
>
> There is also s6-sudod (and a s6-sudoc client to it) that works in a
> similar fashion, though it's a lot less fancy.

Cute.  Now we just distros to work out the kinks and to ship these as
sudo and su :)

>
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess.  Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec().  It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs".  This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
>
> My idea was that this intent could be supplied at clone time, you
> could attach ptrace access modes to a pidfd (we could make those a bit
> granular, perhaps) and any API that takes PIDs and checks against the
> caller's ptrace access mode could instead derive so from the pidfd.
> Since killing is a bit convoluted due to setuid binaries, that should
> work if one is CAP_KILL capable in the owning userns of the task, and
> if not that, has permissions to kill and the target has NNP set.

This CAP_KILL trick makes me nervous.  This particular permission is
really quite powerful, and it would need some analysis to conclude
that it's not *more* powerful than CAP_KILL.

> This
> would allow you to bind kill privileges in a way that is compatible
> with both worlds, the upshot being NNP allows for the functionality to
> be available to a lot more of userspace. Ofcourse, this would require
> a new clone version, possibly with taking a clone2 struct which sets a
> few parameters for the process and the flags for the pidfd.
>
> Another point is that you have a pidfd_open (or something else) that
> can create multiple pidfds from a pidfd obtained at clone time and
> create pidfds with varying level of rights. It can also work by taking
> a TID to open a pidfd for an external task (and then for all the
> rights you wish to acquire on it, check against your ambient
> authority).

Indeed.

>
> (Actually, in general, having FMODE_* style bits spanning all methods
> a file descriptor can take (through system calls), with the type of
> object as key (class containing a set), and be able to enable/disable
> them and seal them would be a useful addition, this all happening at
> the struct file level instead of inode level sealing in memfds).

At the risk of saying a dirty word, the Windows API works quite a bit
like this :)

^ permalink raw reply