From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jim Lieb Subject: Re: Re: [PATCH 1/3] switch_creds: Syscall to switch creds for file server ops Date: Thu, 24 Oct 2013 12:04:38 -0700 Message-ID: <2143620.4IRQt66LHH@jlieb-e6410> References: <1381960919-4542-1-git-send-email-jlieb@panasas.com> <52687468.3060206@mit.edu> <8761snb44l.fsf@xmission.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Andy Lutomirski , Al Viro , , , , , , To: "Eric W. Biederman" Return-path: Received: from static-209-166-131-148.expedient.com ([209.166.131.148]:59180 "EHLO natasha.panasas.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1755722Ab3JXUOC convert rfc822-to-8bit (ORCPT ); Thu, 24 Oct 2013 16:14:02 -0400 In-Reply-To: <8761snb44l.fsf@xmission.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wednesday, October 23, 2013 22:59:54 Eric W. Biederman wrote: > Andy Lutomirski writes: > > On 10/16/2013 08:52 PM, Eric W. Biederman wrote: > >> Al Viro writes: > >>> On Wed, Oct 16, 2013 at 06:18:16PM -0700, Eric W. Biederman wrote= : > >>>> That doesn't look bad but it does need capable(CAP_SETUID) && > >>>> capable(CAP_SETGID) or possibly something a little more refined. > >>>=20 > >>> D'oh > >>>=20 > >>>> I don't think we want file descriptor passing to all of a sudden= become > >>>> a grant of privilege, beyond what the passed fd can do. > >>>=20 > >>> Definitely. And an extra ) to make it compile wouldn't hurt eith= er... > >>=20 > >> There also appears to need to be a check that we don't gain any > >> capabilities. > >>=20 > >> We also need a check so that you don't gain any capabilities, and > >> possibly a few other things. My original api does not pass any caps however, the code in setting up = the new=20 creds does mask off a number of them in the effective set, the same set= that=20 knfsd does. So all I can do is reduce, not extend. We need to reduce = caps=20 because without that, quotas and access get bypassed. If I do the gene= ric (4=20 syscalls), this opens up lots of options for setting things which is no= t my=20 need or intent. > >=20 > > Why? I like the user_ns part, but I'm not immediately seeing the i= ssue > > with capabilities. As for name spaces, we don't have much interest here for two reasons. 1. if the server does run under a different namespace, so be it. The sy= scall=20 does make those translations. If you want to firewall off the service = and its=20 exports, fine. 2. in our main use case, nfs-ganesha is running as a pNFS metadata serv= er and=20 its uid namespace comes from somewhere else (krb5 etc.) anyway. The ke= rnel=20 etc. it is running on is completely separate, a black box with its own=20 nsswitch options, probably local only with only a small set of admin en= tries. Just to be clear, if there is any mapping between uids within the ganes= ha env=20 and the local env (an nfs server that is also an app server), all of th= e above=20 applies assuming the nsswitch/idmapper/ipa is set up properly. The how= that=20 happens is outside scope at this level. >=20 > My reasoning was instead of making this syscall as generic as possibl= e > start it out by only allowing the cases Jim cares about and working w= ith > a model where you can't gain any permissions you couldn't gain > otherwise. Right. That is my intent, to do the same as nfsd_setuser. We need to=20 impersonate the user at the client per the creds we get from the approp= riate=20 authentication service and to only do it for file access, nothing more.= The=20 api I chose allows later extension but only supports what we need now. I looked at the case of a setuid or seteuid with this and the use case = is=20 NFS/CIFS/9P/... user space file servers. That is why I originally chos= e an=20 evil multiplexed call. I could split it into two or three syscalls but= my=20 intent in either api is to have a restricted set of creds. There is a=20 (potential) new use case in NFSv4.2 labels where we will also have to s= hove=20 those down as well, adding another variant. We've got to pass them in = somehow=20 and the set*uid mess is an example of the other way. The setuid family= used=20 to be pretty simple (just two syscalls) but not anymore. At least this= is an=20 extensible model that is consistent and returns useful things like EPER= M=20 rather than an "alternate" uid/gid that must be interpreted to find the= error. I could also restructure the api to have a typed argument. One early p= ass had=20 a struct that had everything in it, the op, the fd, and the creds but t= hat has=20 the pitfall of now freezing the whole api at the syscall entry point. I= f v4.2=20 labels or CIFS SIDs must be passed in future, there would be no alterna= tive=20 but yet another syscall. I do validate the arg. The unsigned long do= es pass=20 32/64. The typedefs in the struct match what is used by the other sysc= alls. =20 Our server uses the same typedefs internally so the usual porting issue= s look=20 minimal (and the same as the setfsuid+setfsgid+setgroups it replaces). I was also trying to cut down on the number of RCU cleanups here. This= is=20 what nfsd_setuser already does. Caching via using an fd as a handle to= the=20 kernel resource is good and lightweight but our server under load will= have=20 to shed "creds" fds before I/O fds (for lock reasons) and that means th= e=20 nfsd_setuser method of setting fsuid, fsgid, altgroups, and restricted = caps=20 could be the common path under load. RCU is cheaper but it is not chea= p and=20 I'd really prefer to not have the fallback make matters worse. Our use case is a meta-data server part of pNFS server, in particular, = pNFS in=20 a clustered env. This means a higher metadata load where these switche= s have=20 real cost. >=20 > Although the fd -1 trick to revert to your other existing cred seems > reasonable. In my revised api idea, I would still need the revert especially with m= ultiple=20 set* calls that have failure paths. I may not know what was left so a = full=20 reset to real is the safest choice in the error path. It is also cheap= (er) in=20 the post- file operation return path. >=20 > >> So I suspect we want a check something like: > >>=20 > >> if ((new_cred->securebits !=3D current_cred->securebits) || > >>=20 > >> (new_cred->cap_inheritable !=3D current_cred->cap_inheritable)= || > >> (new_cred->cap_permitted !=3D current_cred->cap_permitted) || > >> (new_cred->cap_effective !=3D current_cred->cap_effective) || > >> (new_cred->cap_bset !=3D current_cred->cap_bset) || > >> (new_cred->jit_keyring !=3D current_cred->jit_keyring) || > >> (new_cred->session_keyring !=3D current_cred->session_keyring)= || > >> (new_cred->process_keyring !=3D current_cred->process_keyring)= || > >> (new_cred->thread_keyring !=3D current_cred->thread_keyring) |= | > >> (new_cred->request_keyring !=3D current_cred->request_keyring)= || > >> (new_cred->security !=3D current_cred->security) || > >> (new_cred->user_ns !=3D current_cred->user_ns)) { > >> =09 > >> return -EPERM; > >>=20 > >> } > >=20 > > I *really* don't like the idea of being able to use any old file > > descriptor. I barely care what rights the caller needs to have to > > invoke this -- if you're going to pass an fd that grants a capabili= ty > > (in the non-Linux sense of the work), please make sure that the sen= der > > actually wants that behavior. > >=20 > > IOW, have a syscall to generate a special fd for this purpose. It'= s > > only a couple lines of code, and I think we'll really regret it if = we > > fsck this up. > >=20 > > (I will take it as a personal challenge to find at least one exploi= table > > privilege escalation in this if an arbitrary fd works.) >=20 > If you can't switch to a uid or a gid you couldn't switch to otherwis= e > then the worst that can happen is an information leak. And informati= on > leaks are rarely directly exploitable. >=20 > > Also... real_cred looks confusing. AFAICS it is used *only* for kn= fsd > > and faccessat. That is, current userspace can't see it. But now y= ou'll > > expose various oddities. For example, AFAICS a capability-less pro= cess > > that's a userns owner can always use setuid. This will *overwrite* > > real_cred. Then you're screwed, especially if this happens by > > accident. >=20 > And doing in userland what faccessat, and knfsd do in the kernel is > exactly what is desired here. But maybe there are issues with that. >=20 > > That being said, Windows has had functions like this for a long tim= e. > > Processes have a primary token and possibly an impersonation token.= Any > > process can call ImpersonateLoggedOnUser (no privilege required) to > > impersonate the credentials of a token (which is special kind of fd= ). > > Similarly, any process can call RevertToSelf to undo it. > >=20 > > Is there any actual problem with allowing completely unprivileged t= asks > > to switch to one of these magic cred fds? That would avoid needing= a > > "revert" operation. >=20 > If the permission model is this switching of credentials doesn't get = you > anything you couldn't get some other way. That would seem to totally > rules out unprivileged processes switching these things. >=20 > > Another note: I think that there may be issues if the creator of a = token > > has no_new_privs set and the user doesn't. Imagine a daemon that > > accepts one of these fds, impersonates it, and calls exec. This co= uld > > be used to escape from no_new_privs land. >=20 > Which is why I was suggesting that we don't allow changing any field = in > the cred except for uids and gids. Yes. Which is why in my original patch I pass a user_creds structure t= hat=20 only has the fsuid, fsgid, and altgroups. This means that the caller c= an=20 *only* impersonate these particular creds for purposes of file access, = not priv=20 escalation. As for magic fds, I didn't have but should add to the perms check a tes= t to=20 validate that the passed fd is one of our "magic" ones. The other us= ers of=20 anon fds do a similar test to verify that the fd is one that they gave = you.=20 This check would prevent the using of a random open of /dev/null with u= nknown=20 caps/creds because it would throw an error, preventing an escalation. = This fd=20 is also opened with O_CLOEXEC so children don't get it. Our use case = doesn't=20 need to fork/exec and if we needed to, we wouldn't be using switch_cred= s for=20 it anyway. Adding Close-on-exec removes the exploit opportunity. Jim >=20 > Eric --=20 Jim Lieb Linux Systems Engineer Panasas Inc. "If ease of use was the only requirement, we would all be riding tricyc= les" - Douglas Engelbart 1925=E2=80=932013 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html