From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jim Lieb <jlieb@panasas.com>
Subject: Re: Re: Re: Re: [PATCH 1/3] switch_creds: Syscall to switch creds for file server ops
Date: Thu, 31 Oct 2013 13:39:50 -0700
Message-ID: <1397707.n4PsIJcfRu@jlieb-e6410>
References: <1381960919-4542-1-git-send-email-jlieb@panasas.com> <3154208.hemKm7CgkG@jlieb-e6410> <CALCETrVaT7EaPOs7w=i_WJQS9mNEXK1-GvPni2jJ29Ne+2X8gA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	<bfields@redhat.com>, Jeff Layton <jlayton@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <CALCETrVaT7EaPOs7w=i_WJQS9mNEXK1-GvPni2jJ29Ne+2X8gA@mail.gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Thursday, October 31, 2013 12:48:54 Andy Lutomirski wrote:
> On Thu, Oct 31, 2013 at 12:43 PM, Jim Lieb <jlieb@panasas.com> wrote:
> > On Thursday, October 31, 2013 12:09:08 Andy Lutomirski wrote:
> >> On Thu, Oct 24, 2013 at 1:24 PM, Jim Lieb <jlieb@panasas.com> wrot=
e:
> >> > On Thursday, October 24, 2013 12:28:15 Andy Lutomirski wrote:
> >> >> On Wed, Oct 23, 2013 at 10:59 PM, Eric W. Biederman
> >> >>=20
> >> >> <ebiederm@xmission.com> wrote:
> >> >> > Andy Lutomirski <luto@amacapital.net> writes:
> >> >> >> On 10/16/2013 08:52 PM, Eric W. Biederman wrote:
> >> >> >>> Al Viro <viro@ZenIV.linux.org.uk> writes:
> >> >> >>>> On Wed, Oct 16, 2013 at 06:18:16PM -0700, Eric W. Biederma=
n=20
wrote:
> >> >> >>>>> That doesn't look bad but it does need capable(CAP_SETUID=
) &&
> >> >> >>>>> capable(CAP_SETGID) or possibly something a little more r=
efined.
> >> >> >>>>=20
> >> >> >>>> D'oh
> >> >> >>>>=20
> >> >> >>>>> I don't think we want file descriptor passing to all of a=
 sudden
> >> >> >>>>> become
> >> >> >>>>> a grant of privilege, beyond what the passed fd can do.
> >> >> >>>>=20
> >> >> >>>> Definitely.  And an extra ) to make it compile wouldn't hu=
rt
> >> >> >>>> either...
> >> >> >>>=20
> >> >> >>> There also appears to need to be a check that we don't gain=
 any
> >> >> >>> capabilities.
> >> >> >>>=20
> >> >> >>> We also need a check so that you don't gain any capabilitie=
s, and
> >> >> >>> possibly a few other things.
> >> >> >>=20
> >> >> >> Why?  I like the user_ns part, but I'm not immediately seein=
g the
> >> >> >> issue
> >> >> >> with capabilities.
> >> >> >=20
> >> >> > My reasoning was instead of making this syscall as generic as
> >> >> > possible
> >> >> > start it out by only allowing the cases Jim cares about and w=
orking
> >> >> > with
> >> >> > a model where you can't gain any permissions you couldn't gai=
n
> >> >> > otherwise.
> >> >> >=20
> >> >> > Although the fd -1 trick to revert to your other existing cre=
d seems
> >> >> > reasonable.
> >> >> >=20
> >> >> >>> So I suspect we want a check something like:
> >> >> >>>=20
> >> >> >>> if ((new_cred->securebits !=3D current_cred->securebits)  |=
|
> >> >> >>>=20
> >> >> >>>     (new_cred->cap_inheritable !=3D current_cred->cap_inher=
itable)
> >> >> >>>     ||
> >> >> >>>     (new_cred->cap_permitted !=3D current_cred->cap_permitt=
ed) ||
> >> >> >>>     (new_cred->cap_effective !=3D current_cred->cap_effecti=
ve) ||
> >> >> >>>     (new_cred->cap_bset !=3D current_cred->cap_bset) ||
> >> >> >>>     (new_cred->jit_keyring !=3D current_cred->jit_keyring) =
||
> >> >> >>>     (new_cred->session_keyring !=3D current_cred->session_k=
eyring)
> >> >> >>>     ||
> >> >> >>>     (new_cred->process_keyring !=3D current_cred->process_k=
eyring)
> >> >> >>>     ||
> >> >> >>>     (new_cred->thread_keyring !=3D current_cred->thread_key=
ring) ||
> >> >> >>>     (new_cred->request_keyring !=3D current_cred->request_k=
eyring)
> >> >> >>>     ||
> >> >> >>>     (new_cred->security !=3D current_cred->security) ||
> >> >> >>>     (new_cred->user_ns !=3D current_cred->user_ns)) {
> >> >> >>>    =20
> >> >> >>>      return -EPERM;
> >> >> >>>=20
> >> >> >>> }
> >> >> >>=20
> >> >> >> I *really* don't like the idea of being able to use any old =
file
> >> >> >> descriptor.  I barely care what rights the caller needs to h=
ave to
> >> >> >> invoke this -- if you're going to pass an fd that grants a
> >> >> >> capability
> >> >> >> (in the non-Linux sense of the work), please make sure that =
the
> >> >> >> sender
> >> >> >> actually wants that behavior.
> >> >> >>=20
> >> >> >> IOW, have a syscall to generate a special fd for this purpos=
e.=20
> >> >> >> It's
> >> >> >> only a couple lines of code, and I think we'll really regret=
 it if
> >> >> >> we
> >> >> >> fsck this up.
> >> >> >>=20
> >> >> >> (I will take it as a personal challenge to find at least one
> >> >> >> exploitable
> >> >> >> privilege escalation in this if an arbitrary fd works.)
> >> >> >=20
> >> >> > If you can't switch to a uid or a gid you couldn't switch to
> >> >> > otherwise
> >> >> > then the worst that can happen is an information leak.  And
> >> >> > information
> >> >> > leaks are rarely directly exploitable.
> >> >>=20
> >> >> Here's the attack:
> >> >>=20
> >> >> Suppose there's a daemon that uses this in conjunction with
> >> >> SCM_RIGHTS.  The daemon is effectively root (under the current
> >> >> proposal, it has to be).  So a client connects, sends a credent=
ial fd,
> >> >> and the daemon impersonates those credentials.
> >> >>=20
> >> >> Now a malicious user obtains *any* fd opened by root.  It sends=
 that
> >> >> fd to the daemon.  The daemon then impersonates root.  We lose.=
  (It
> >> >> can't possibly be hard to obtain an fd with highly privileged f=
_cred
> >> >> -- I bet that most console programs have stdin like that, for e=
xample.
> >> >>=20
> >> >>  There are probably many setuid programs that will happily open
> >> >>=20
> >> >> /dev/null for you, too.)
> >> >=20
> >> > In my reply to Eric, I note that I need to add a check that the =
fd
> >> > passed
> >> > is one from switch_creds. With that test, not any fd will do and=
 the
> >> > one
> >> > that does has only been able to set fsuid, fsgid, altgroups, and
> >> > reduced
> >> > (the nfsd set) caps.  They can do no more damage than what the o=
riginal
> >> > switch_creds allowed.  The any fd by root no longer applies so u=
se
> >> > doesn't get much (no escalation).
> >> >=20
> >> >> >> Also... real_cred looks confusing.  AFAICS it is used *only*=
 for
> >> >> >> knfsd
> >> >> >> and faccessat.  That is, current userspace can't see it.  Bu=
t now
> >> >> >> you'll
> >> >> >> expose various oddities.  For example, AFAICS a capability-l=
ess
> >> >> >> process
> >> >> >> that's a userns owner can always use setuid.  This will *ove=
rwrite*
> >> >> >> real_cred.  Then you're screwed, especially if this happens =
by
> >> >> >> accident.
> >> >> >=20
> >> >> > And doing in userland what faccessat, and knfsd do in the ker=
nel is
> >> >> > exactly what is desired here.  But maybe there are issues wit=
h that.
> >> >> >=20
> >> >> >> That being said, Windows has had functions like this for a l=
ong
> >> >> >> time.
> >> >> >> Processes have a primary token and possibly an impersonation=
 token.
> >> >> >> Any
> >> >> >> process can call ImpersonateLoggedOnUser (no privilege requi=
red) to
> >> >> >> impersonate the credentials of a token (which is special kin=
d of
> >> >> >> fd).
> >> >> >> Similarly, any process can call RevertToSelf to undo it.
> >> >> >>=20
> >> >> >> Is there any actual problem with allowing completely unprivi=
leged
> >> >> >> tasks
> >> >> >> to switch to one of these magic cred fds?  That would avoid =
needing
> >> >> >> a
> >> >> >> "revert" operation.
> >> >> >=20
> >> >> > If the permission model is this switching of credentials does=
n't get
> >> >> > you
> >> >> > anything you couldn't get some other way.  That would seem to
> >> >> > totally
> >> >> > rules out unprivileged processes switching these things.
> >> >>=20
> >> >> IMO, there are two reasonable models that involve fds carrying =
some
> >> >> kind of credential.
> >> >>=20
> >> >> 1. The fd actually carries the ability to use the credentials. =
 You
> >> >> need to be very careful whom you send these to.  The security
> >> >> implications are obvious (which is good) and the receiver doesn=
't need
> >> >> privilege (which is good, as long as the receiver is careful).
> >> >=20
> >> > I test for caps.  I think using switch_creds in all forms should
> >> > require
> >> > privs.  I thought of the revert case and although it does simply=
 return
> >> > to
> >> > "real" creds, you still need CAP_SETUID and CAP_SETGID to do it.=
  This
> >> > means, of course, if you have really messed up things, you may n=
ot be
> >> > able to get back home which, although a bad thing for the proces=
s, is a
> >> > good thing for the system as a whole.
> >> >=20
> >> >> 2. The fd is for identification only.  But this means that the =
fd
> >> >> carries the ability to identify as a user.  So you *still* have=
 to be
> >> >> very careful about whom you send it to.  What you need is an op=
eration
> >> >> that allows you to identify using the fd without transitively g=
ranting
> >> >> the recipient the same ability.  On networks, this is done by s=
igning
> >> >> some kind of challenge.  The kernel could work the same way, or=
 there
> >> >> could be a new CMSG_IDENTITY that you need an identity fd to se=
nd but
> >> >> that does not copy that fd to the recipient.
> >> >=20
> >> > I am not sure I understand this.  CMSG only applies to UNIX_DOMA=
IN
> >> > sockets
> >> > which means that the switch_creds fd test still applies here.  I=
t is
> >> > identification but only for within the same kernel.  As for name=
spaces,
> >> > the
> >> > translation was done when the creds fd was created.  I suppose i=
f it
> >> > was
> >> > passed across namespace boundaries there could be a problem but =
what is
> >> > in
> >> > the creds structure is the translated fduid,fsgid etc., not the
> >> > untranslated one. We are still doing access checks and quota stu=
ff with
> >> > the translated creds.  If one namespace has "fred" as uid 52 and
> >> > another
> >> > has "mary" as 52, the quota will only be credited to whoever "fr=
ed"
> >> > really is only if "fred" gets to be "mary" in the second alterna=
te
> >> > universe...
> >>=20
> >> I'm not talking about namespaces here -- I think we're not really =
on
> >> the same page.  Can you describe what you're envisioning doing wit=
h
> >> these fds?
> >=20
> > I have a new version that is now two syscalls and no multiplexing h=
as more
> > testing etc. to mirror exactly the tests in setfsuid().  I still am
> > testing
> > but plan to send this new patchset next week for review.
> >=20
> > Ok,  I may have missed something.  What I meant to say is that I'm =
using
> > the same namespace functions for switch_creds as all the set*uid sy=
scalls
> > use so what I have should work as well.  If one were to pass this f=
d via
> > CMSG, it could cross a namespace boundary.  The changed mappings of=
 this
> > fd would be the same as for any other fd passed across now.  Maybe =
that
> > is irrelevant in this case.
> >=20
> > The purpose of the fd is to create a handle to a creds set that a s=
erver
> > can then efficiently switch to prior to filesystem syscalls where f=
suid
> > etc. are relevant (mkdir, creat, write, etc.).  This is exposing th=
e
> > override_creds() and revert_creds() to these servers.  I do not int=
end in
> > our server to hand these fds to anyone else.  After all, they are u=
seless
> > for anything other than switch_creds/use_creds.
> >=20
> > The server keeps a cache of its known, currently active client cred=
s along
> > with this fd.  The fast path is to lookup the creds and use the fd.=
  On
> > cache misses, it does a switch_creds() and saves the fd in the cach=
e.
>=20
> So if I understand you correctly, you're not planning on sending this
> fd between process at all.  In that case, adding a new API that seems
> designed for interprocess use and having exactly zero usecases to
> think about makes me nervous.  Why is it going to use fds again?

Correct.  They are handles to a set of creds for filesystem calls.  I u=
se fds=20
because every filp has a pointer to a set of creds that gets swapped in=
 for=20
various things like coredumps, and in the case of knfsd, impersonating =
the nfs=20
client for when it does filesystem morphing ops.  In order to currently=
 do what=20
nfsd_setuser() does we must do:

	setfsuid();setfsgid(); setgroups(); setcaps();
followed by=20
	open/creat/mknod/write
followed by
	 setfsuid(); setfsgid(); setgroups(); setcaps();

Each one of these 4 syscalls discards an RCU copy of the creds, both go=
ing in=20
and going out.  In the first case below, the revert doesn't recycle the=
 RCU=20
because the open fd still holds a ref.  Same applied to the second case=
=2E  In=20
fact here, no new creds are created either.  The creds only get recycle=
d when=20
the fd is closed.

Using an fd to have a handle on what all these do is the simplest thing=
 to=20
pass back and forth between user and kernel space.  This is what Ted T'=
so and=20
Al Viro suggested.  This ends up looking like:

	fd =3D switch_creds(creds struct);
followed by
	open/creat/mknod/write
followed by
	use_creds(-1);

The -1 arg is used as I proposed in the earlier round, namely, it tells=
=20
use_creds to revert to real creds.

Subsequent uses look like:

	use_creds(cached fd);
followed by
	open/creat/mknod/write
followed by
	use_creds(-1);


There is no need to pass the fd because if the other side has the caps =
to use=20
the fd, they have the caps to get their own.  We are saving overhead.
>=20
> Or maybe I should wait to see the updated API before complaining :)
>=20
> >> >> >> Another note: I think that there may be issues if the creato=
r of a
> >> >> >> token
> >> >> >> has no_new_privs set and the user doesn't.  Imagine a daemon=
 that
> >> >> >> accepts one of these fds, impersonates it, and calls exec.  =
This
> >> >> >> could
> >> >> >> be used to escape from no_new_privs land.
> >> >> >=20
> >> >> > Which is why I was suggesting that we don't allow changing an=
y field
> >> >> > in
> >> >> > the cred except for uids and gids.
> >> >>=20
> >> >> If the daemon impersonates and execs, we still lose.
> >> >=20
> >> > I answered this.  You only get to impersonate for purposes of fi=
le
> >> > access
> >> > with reduced caps to prevent you from being root as well.  Also,=
 since
> >> > these are O_CLOEXEC and switch_creds "magic" fds, this can't hap=
pen
> >> > because the fd is gone post-exec.
> >>=20
> >> If a no_new_privs process authenticates to a daemon and that daemo=
n
> >> execs, then there's now a dumpable, ptraceable, non-no-new-privs
> >> process running as the uid/gid of the no_new_privs process.  This =
may
> >> be dangerous.
> >=20
> > Two things.  My new patch set now throws an error if the fd is not =
one
> > returned by switch_creds().  The new syscall here is use_creds() bt=
w.=20
> > That fd is opened O_CLOEXEC so the syscall has two guards.  First, =
it can
> > only be one that switch_creds() returned and second, it gets closed=
 on an
> > exec so the no_new_privs process doesn't get it.  In addition,
> > switch_creds() reduces the effective caps set to the same minimal s=
et
> > that nfsd_setuser() has.
> >=20
> > If someone else chooses to pass this fd via CMSG, whoever gets it h=
as
> > reduced privs and in order to use it for anything, i.e. use_creds()=
, it
> > still needs to inherit SETUID and SETGID caps from its parent or it=
 will
> > get an EPERM error. Passing this in "friendly" code defeats the pur=
pose
> > of the cache above in that we could get fd leaks.  If there is anot=
her
> > flag that could prevent its being passed along via CMSG, I can add =
that
> > too because using CMSG is well outside the use case.
>=20
> I still don't see the point of lowering effective caps, but this is
> IMO minor.  Are you checking the permitted set on revert?

We lower specific filesystem related caps to prevent using the root byp=
ass of=20
acess checks, prevent root bypass of quota restrictions, and to correct=
ly=20
charge the impersonated user for quota'd resource usage.  This is what=20
nfsd_setuser() does.  I use the same mask set.

I don't check the permitted set associated with the fd because I've alr=
eady=20
checked that it is a switch_creds() returned fd (they are immutable).  =
I do=20
make a check at the beginning of both syscalls for the caps to set uid/=
gid,=20
the same test as setfsuid et al.

>=20
> --Andy

--=20
Jim Lieb
Linux Systems Engineer
Panasas Inc.

"If ease of use was the only requirement, we would all be riding tricyc=
les"
- Douglas Engelbart 1925=E2=80=932013