From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stanislav Kinsbursky <skinsbursky@parallels.com>
Subject: Re: [PATCH review 52/85] sunrpc: Properly encode kuids and kgids
 in auth.unix.gid rpc pipe upcalls.
Date: Thu, 14 Feb 2013 11:12:16 +0400
Message-ID: <511C8E50.8080007@parallels.com>
References: <87621w14vs.fsf@xmission.com> <1360777934-5663-1-git-send-email-ebiederm@xmission.com> <1360777934-5663-52-git-send-email-ebiederm@xmission.com> <20130213210545.GO14195@fieldses.org> <874nhfrjgg.fsf@xmission.com> <20130213215047.GR14195@fieldses.org> <8738wzq1z6.fsf@xmission.com> <20130213225840.GV14195@fieldses.org> <87ip5vn6iv.fsf@xmission.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1251;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	<linux-fsdevel@vger.kernel.org>,
	Linux Containers <containers@lists.linux-foundation.org>,
	<linux-kernel@vger.kernel.org>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	"Trond Myklebust" <Trond.Myklebust@netapp.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <87ip5vn6iv.fsf@xmission.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

14.02.2013 03:22, Eric W. Biederman =EF=E8=F8=E5=F2:
> "J. Bruce Fields" <bfields@fieldses.org> writes:
>
>> On Wed, Feb 13, 2013 at 02:32:29PM -0800, Eric W. Biederman wrote:
>>> "J. Bruce Fields" <bfields@fieldses.org> writes:
>>>
>>>> On Wed, Feb 13, 2013 at 01:29:35PM -0800, Eric W. Biederman wrote:
>>>>> "J. Bruce Fields" <bfields@fieldses.org> writes:
>>>>>
>>>>>> On Wed, Feb 13, 2013 at 09:51:41AM -0800, Eric W. Biederman wrot=
e:
>>>>>>> From: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>>>
>>>>>>> When a new rpc connection is established with an in-kernel serv=
er, the
>>>>>>> traffic passes through svc_process_common, and svc_set_client a=
nd down
>>>>>>> into svcauth_unix_set_client if it is of type RPC_AUTH_NULL or
>>>>>>> RPC_AUTH_UNIX.
>>>>>>>
>>>>>>> svcauth_unix_set_client then looks at the uid of the credential=
 we
>>>>>>> have assigned to the incomming client and if we don't have the =
groups
>>>>>>> already cached makes an upcall to get a list of groups that the=
 client
>>>>>>> can use.
>>>>>>>
>>>>>>> The upcall encodes send a rpc message to user space encoding th=
e uid
>>>>>>> of the user whose groups we want to know.  Encode the kuid of t=
he user
>>>>>>> in the initial user namespace as nfs mounts can only happen tod=
ay in
>>>>>>> the initial user namespace.
>>>>>>
>>>>>> OK, I didn't know that.
>>>>>>
>>>>>> (Though I'm unclear how it should matter to the server what user
>>>>>> namespace the client is in?)
>>>>>
>>>>> Perhaps I have the description a little scrambled.  The short ver=
sion
>>>>> is that to start I only support the initial network namespace.
>>>>>
>>>>> If I haven't succeeded it is my intent to initially limit the ser=
vers
>>>>> to the initial user namespace as well.  I should see if I can fig=
ure
>>>>> that out.
>>>>>
>>>>>>> When a reply to an upcall comes in convert interpret the uid an=
d gid values
>>>>>>> from the rpc pipe as uids and gids in the initial user namespac=
e and convert
>>>>>>> them into kuids and kgids before processing them further.
>>>>>>>
>>>>>>> When reading proc files listing the uid to gid list cache conve=
rt the
>>>>>>> kuids and kgids from into uids and gids the initial user namesp=
ace.  As we are
>>>>>>> displaying server internal details it makes sense to display th=
ese values
>>>>>>> from the servers perspective.
>>>>>>
>>>>>> All of these caches are already per-network-namespace.  Ideally =
wouldn't
>>>>>> we also like to associate a user namespace with each cache someh=
ow?
>>>>>
>>>>> Ideally yes.  I read through the caches enough to figure out wher=
e there
>>>>> user space interfaces were, and to make certain we had conversion=
s
>>>>> to/from kuids and kgids.
>>>>>
>>>>> I haven't looked at what user namespace makes sense for these
>>>>> caches.  For this cache my first guess is that net->user_ns
>>>>> is what we want as it will be shared by all users in network name=
space I
>>>>> presume.
>>>>
>>>> Oh, I didn't know about net->user_ns--so each network namespace is
>>>> associated with a single user namespace, great, that simplifies li=
fe.
>>>> Yes, that sounds exactly right.
>>>
>>> Yes. net->user_ns is the user namespace the network namespace was
>>> created in.  And it is the user namespace that is used in test
>>> like ns_capable(net->user_ns, CAP_NET_ADMIN) to see if you are allo=
wed
>>> to manipulate the network namespace.  So looks like exactly what we
>>> want for that cache.
>>>
>>> Could you double check my understanding of the code?
>>>
>>> I want to be certain that I can't _yet_ start an sunrpc server proc=
ess
>>> outside of the initial user namespace.  While writing an earlier re=
ply I
>>> realized that I hadn't thought about where sunrpc server processes =
come
>>> from.
>>>
>>> Reading through the code it looks like we can have nfs mounts outsi=
de of
>>> the initial network namespace.
>>
>> We're talking about the server side here, not the client, so I'm not
>> sure what you mean by "nfs mounts".  The nfs server does use various
>> pseudofilesystems ("proc", "nfsd"), and those can be mounted outside=
 the
>> initial network namespace.
>
> Actually I was seeing that nfs clients were starting lockd.  So I was
> just reasoning here that anything that came from a nfs client was
> ultimately in the user namespace of that client, which is ultimately
> limited by the client out.
>
>> The server can receive rpc requests over network interfaces outside =
the
>> initial network namespace, sure.  The server doesn't perform mounts =
on
>> behalf of clients, though, it just accesses previously mounted
>> filesystems on clients' behalf.
>
> But nfsd_init_socks only creates sockets in a single network namespac=
e,
> and today we pass only &init_net.
>
>>> But because they are mounts they are
>>> still limited to the initial user namespace.
>>
>> OK, so that's just a limitation on any mount whatsoever for now.  I'=
m
>> catching on, slowly, thanks!
>
> If you set in struct filesystem .fs_flags =3D FS_USERNS_MOUNT your
> filesystem can be mounted outside of the initial user namespace.  But
> since that takes extra work and because unprivileged users are allowe=
d
> to create user namespaces and perform the mounts by default it is off=
=2E
>
>>> Now looking at the nfs server, seems to be hard coded to only start
>>> in the initial network namespace despite almost having support for
>>> starting in more.
>>
>> Right, Stanislav's got 4 more patches that should finish the job; se=
e
>> http://mid.gmane.org/<20130201125210.3257.46454.stgit@localhost.loca=
ldomain>
>> and followups.  That should make it for 3.9, I just need to review
>> them....
>
> Ok that is interesting.
>
> There is an interesting corner case here where an unprivileged user
> can create a user namespace and then can create a network namespace.
> Depending on how we interpret things when Stanislaves patches reach
> there we might have to add:
>
> if (net->user_ns !=3D &init_user_ns)
> 	-EINVAL
>
> Somewhere appropriate.
>
>>> Even more the nfs server is controlled and started through the "nfs=
d"
>>> filesystem.  Which has to be mounted before you can start the serve=
r.
>>> So you can only start the server through a mount in the initial use=
r
>>> namespace.
>>
>> Yes.
>>
>>> lockd is started by either the nfs server or the nfs client.
>>>
>>> There are no other sunrpc servers in the kernel.
>>
>> There are a couple callback services on the NFS client--those should=
 be
>> associated with nfs mounts in some obvious way.  There's a confusing=
 ACL
>> service that's really just an appendage of NFSv2/v3 service.
>>
>> I think we're fine.
>
> Thanks.
>
>>> I think all of that is enough to reasonably claim that you can't ha=
ve
>>> any sunrpc server processes outside of the initial user namespace. =
 But
>>> if I am wrong I would to find an appropriate spot to put in a line
>>> that says:
>>> 	 if (current_user_ns() !=3D &init_user)
>>>           	return -ESORRY_CHARLEY;
>>
>> I think you're right.
>>
>> So for now it's safely confined to one user namespace, and I think w=
e
>> understand approximately what to do if we want to support nfsd's in =
user
>> namespace in the future.  (Mainly, make sure nfsd and proc can be
>> mounted in them and then most things will be determined by the user_=
ns
>> of the network namespace associated with a given rpc.)
>
> For 3.9 the list of filesystems mountable outside the initial user
> namespace is: mqueuefs, tmpfs, ramfs, devpts, sysfs, and proc.
>
> I am a touch concerned about /proc/fs/nfsd/exports after my patches
> and Stanislavs patches both come in.  As I think that will allow for
> cases where net->user_ns !=3D &init_userns.  But we can cross that br=
idge
> when we come to it.
>


Hmmm...
Maybe I'm missing the point of user namespaces, but since NFS kernel se=
rver
is controlled via NFSd file system write calls, maybe it would be bette=
r to add:

=2Efs_flags =3D FS_USERNS_MOUNT

to it and add check:

+	if (net->user_ns !=3D current_user_ns())
+		return -EINVAL;

No?


> Eric
>


--=20
Best regards,
Stanislav Kinsbursky