[PATCH] RDMA/uverbs: Consider capability of the process that opens the file

linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
@ 2025-03-13  5:08 Parav Pandit
  2025-03-17 19:31 ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-03-13  5:08 UTC (permalink / raw)
  To: linux-rdma, linux-security-module
  Cc: ebiederm, serge, leonro, jgg, Parav Pandit

Currently, the capability check is done on the current process which
may have the CAP_NET_RAW capability, but such process may not have
opened the file. A file may could have been opened by a lesser
privilege process that does not possess the CAP_NET_RAW capability.

To avoid such situations, perform the capability checks against
the file's credentials. This approach ensures that the capabilities
of the process that opened the file are enforced.

Fixes: c938a616aadb ("IB/core: Add raw packet QP type")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>

---

Eric,

Shouldn't we check the capabilities of the process that opened the
file and also the current process that is issuing the create_flow()
ioctl? This way, the minimum capabilities of both processes are
considered.

---
 drivers/infiniband/core/uverbs_cmd.c  | 2 +-
 drivers/infiniband/core/uverbs_main.c | 2 +-
 include/rdma/uverbs_types.h           | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 96d639e1ffa0..e028454bcd7e 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3217,7 +3217,7 @@ static int ib_uverbs_ex_create_flow(struct uverbs_attr_bundle *attrs)
 	if (cmd.comp_mask)
 		return -EINVAL;
 
-	if (!capable(CAP_NET_RAW))
+	if (!file_ns_capable(attrs->ufile->filp, &init_user_ns, CAP_NET_RAW))
 		return -EPERM;
 
 	if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED)
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 973fe2c7ef53..8e5ee702e9f8 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -993,7 +993,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 
 	setup_ufile_idr_uobject(file);
-
+	file->filp = filp;
 	return stream_open(inode, filp);
 
 err_module:
diff --git a/include/rdma/uverbs_types.h b/include/rdma/uverbs_types.h
index 26ba919ac245..06f57d28d349 100644
--- a/include/rdma/uverbs_types.h
+++ b/include/rdma/uverbs_types.h
@@ -181,6 +181,7 @@ struct ib_uverbs_file {
 	struct xarray		idr;
 
 	struct mutex disassociation_lock;
+	struct file *filp;
 };
 
 extern const struct uverbs_obj_type_class uverbs_idr_class;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-13  5:08 [PATCH] RDMA/uverbs: Consider capability of the process that opens the file Parav Pandit
@ 2025-03-17 19:31 ` Jason Gunthorpe
  2025-03-18  3:43   ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-03-17 19:31 UTC (permalink / raw)
  To: Parav Pandit; +Cc: linux-rdma, linux-security-module, ebiederm, serge, leonro

On Thu, Mar 13, 2025 at 07:08:32AM +0200, Parav Pandit wrote:
> Currently, the capability check is done on the current process which
> may have the CAP_NET_RAW capability, but such process may not have
> opened the file. A file may could have been opened by a lesser
> privilege process that does not possess the CAP_NET_RAW capability.

> To avoid such situations, perform the capability checks against
> the file's credentials. This approach ensures that the capabilities
> of the process that opened the file are enforced.
> 
> Fixes: c938a616aadb ("IB/core: Add raw packet QP type")
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
> 
> ---
> 
> Eric,
> 
> Shouldn't we check the capabilities of the process that opened the
> file and also the current process that is issuing the create_flow()
> ioctl? This way, the minimum capabilities of both processes are
> considered.

I would say no, that is not our model in RDMA. The process that opens
the file is irrelevant. We only check the current system call context
for capability, much like any other systemcall.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-17 19:31 ` Jason Gunthorpe
@ 2025-03-18  3:43   ` Parav Pandit
  2025-03-18 11:20     ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-03-18  3:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	ebiederm@xmission.com, serge@hallyn.com, Leon Romanovsky



> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 18, 2025 1:02 AM
> 
> On Thu, Mar 13, 2025 at 07:08:32AM +0200, Parav Pandit wrote:
> > Currently, the capability check is done on the current process which
> > may have the CAP_NET_RAW capability, but such process may not have
> > opened the file. A file may could have been opened by a lesser
> > privilege process that does not possess the CAP_NET_RAW capability.
> 
> > To avoid such situations, perform the capability checks against the
> > file's credentials. This approach ensures that the capabilities of the
> > process that opened the file are enforced.
> >
> > Fixes: c938a616aadb ("IB/core: Add raw packet QP type")
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
> >
> > ---
> >
> > Eric,
> >
> > Shouldn't we check the capabilities of the process that opened the
> > file and also the current process that is issuing the create_flow()
> > ioctl? This way, the minimum capabilities of both processes are
> > considered.
> 
> I would say no, that is not our model in RDMA. The process that opens the file
> is irrelevant. We only check the current system call context for capability,
> much like any other systemcall.
> 
Eric explained the motivation [1] and [2] for this fix is:
A lesser privilege process A opens the fd (currently caps are not checked), passes the fd to a higher privilege process B.
And somehow let process B pass the needed capabilities check for resource creation, after which process A continue to use the resource without capability.

[1] https://lore.kernel.org/linux-rdma/87ecz4q27k.fsf@email.froward.int.ebiederm.org/
[2] https://lore.kernel.org/linux-rdma/87msdsoism.fsf@email.froward.int.ebiederm.org/


> Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-18  3:43   ` Parav Pandit
@ 2025-03-18 11:20     ` Jason Gunthorpe
  2025-03-18 12:30       ` Parav Pandit
  2025-03-18 20:00       ` Eric W. Biederman
  0 siblings, 2 replies; 58+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 11:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	ebiederm@xmission.com, serge@hallyn.com, Leon Romanovsky

On Tue, Mar 18, 2025 at 03:43:07AM +0000, Parav Pandit wrote:

> > I would say no, that is not our model in RDMA. The process that opens the file
> > is irrelevant. We only check the current system call context for capability,
> > much like any other systemcall.
> >
> Eric explained the motivation [1] and [2] for this fix is:
> A lesser privilege process A opens the fd (currently caps are not
> checked), passes the fd to a higher privilege process B.

> And somehow let process B pass the needed capabilities check for
> resource creation, after which process A continue to use the
> resource without capability.

Yes, I'd say that is fine within our model, and may even be desirable
in some cases.

We don't use a file descriptor linked security model, it is always
secured based on the individual ioctl system call. The file descriptor
is just a way to route the system calls.

The "setuid cat" risk is interesting, but we are supposed to be
preventing that by using ioctl, no 'cat' program is going to randomly
execute ioctls on stdout.

You would not say that if process B creates a CAP_NET_RAW socket FD
and passes it to process A without CAP_NET_RAW then A should not be
able to use the FD.

The same principle holds here too, the object handles scoped inside
the FD should have the same kind of security properties as a normal FD
would.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-18 11:20     ` Jason Gunthorpe
@ 2025-03-18 12:30       ` Parav Pandit
  2025-03-18 12:44         ` Jason Gunthorpe
  2025-03-18 20:00       ` Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-03-18 12:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	ebiederm@xmission.com, serge@hallyn.com, Leon Romanovsky



> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 18, 2025 4:51 PM
> 
> On Tue, Mar 18, 2025 at 03:43:07AM +0000, Parav Pandit wrote:
> 
> > > I would say no, that is not our model in RDMA. The process that
> > > opens the file is irrelevant. We only check the current system call
> > > context for capability, much like any other systemcall.
> > >
> > Eric explained the motivation [1] and [2] for this fix is:
> > A lesser privilege process A opens the fd (currently caps are not
> > checked), passes the fd to a higher privilege process B.
> 
> > And somehow let process B pass the needed capabilities check for
> > resource creation, after which process A continue to use the resource
> > without capability.
> 
> Yes, I'd say that is fine within our model, and may even be desirable in some
> cases.
>
Is this subsystem specific?
I was thinking it is generic enough to all configurations done through ioctl().
For example, I don't see any difference between [1] and rdma.

[1] https://github.com/torvalds/linux/blob/76b6905c11fd3c6dc4562aefc3e8c4429fefae1e/block/ioctl.c#L441
 
> We don't use a file descriptor linked security model, it is always secured based
> on the individual ioctl system call. The file descriptor is just a way to route the
> system calls.
If I understood right, Eric suggests to improve this model by file level additional checks.

> 
> The "setuid cat" risk is interesting, but we are supposed to be preventing that
> by using ioctl, no 'cat' program is going to randomly execute ioctls on stdout.
> 
> You would not say that if process B creates a CAP_NET_RAW socket FD and
> passes it to process A without CAP_NET_RAW then A should not be able to
> use the FD.
Well, process B is higher privilege which shared the socket FD.

This is what I was asking in this patch to Eric above, should we have the min() check of both the process?

Or may be sharing the fd is somewhat special case, and generically it should be check when sharing the fd itself?

Each has tradeoffs; would like to understand what is the current generic model that we can adopt in rdma subsystem too.

> 
> The same principle holds here too, the object handles scoped inside the FD
> should have the same kind of security properties as a normal FD would.
> 
> Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-18 12:30       ` Parav Pandit
@ 2025-03-18 12:44         ` Jason Gunthorpe
  0 siblings, 0 replies; 58+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 12:44 UTC (permalink / raw)
  To: Parav Pandit
  Cc: linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	ebiederm@xmission.com, serge@hallyn.com, Leon Romanovsky

On Tue, Mar 18, 2025 at 12:30:18PM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 18, 2025 4:51 PM
> > 
> > On Tue, Mar 18, 2025 at 03:43:07AM +0000, Parav Pandit wrote:
> > 
> > > > I would say no, that is not our model in RDMA. The process that
> > > > opens the file is irrelevant. We only check the current system call
> > > > context for capability, much like any other systemcall.
> > > >
> > > Eric explained the motivation [1] and [2] for this fix is:
> > > A lesser privilege process A opens the fd (currently caps are not
> > > checked), passes the fd to a higher privilege process B.
> > 
> > > And somehow let process B pass the needed capabilities check for
> > > resource creation, after which process A continue to use the resource
> > > without capability.
> > 
> > Yes, I'd say that is fine within our model, and may even be desirable in some
> > cases.
>
> Is this subsystem specific?

Probably.. How a FD works and it's security model is very specific to
each FD type.

> I was thinking it is generic enough to all configurations done through ioctl().
> For example, I don't see any difference between [1] and rdma.
> 
> [1] https://github.com/torvalds/linux/blob/76b6905c11fd3c6dc4562aefc3e8c4429fefae1e/block/ioctl.c#L441

Isn't that the same thing? roset is an ioctl on a block char dev file
descriptor. It doesn't check the capability of the process that opened
the file.

> > We don't use a file descriptor linked security model, it is always secured based
> > on the individual ioctl system call. The file descriptor is just a way to route the
> > system calls.
> If I understood right, Eric suggests to improve this model by file level additional checks.

Yes, but I'm not sure it is an improvement, or that it won't cause
regressions.

> > You would not say that if process B creates a CAP_NET_RAW socket FD and
> > passes it to process A without CAP_NET_RAW then A should not be able to
> > use the FD.
> Well, process B is higher privilege which shared the socket FD.

Yes, and?
 
> This is what I was asking in this patch to Eric above, should we have the min() check of both the process?

No. We don't do it above for sockets, we shouldn't do it for RDMA
objects either.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-18 11:20     ` Jason Gunthorpe
  2025-03-18 12:30       ` Parav Pandit
@ 2025-03-18 20:00       ` Eric W. Biederman
  2025-03-18 22:57         ` Jason Gunthorpe
  1 sibling, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2025-03-18 20:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, serge@hallyn.com,
	Leon Romanovsky

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Mar 18, 2025 at 03:43:07AM +0000, Parav Pandit wrote:
>
>> > I would say no, that is not our model in RDMA. The process that opens the file
>> > is irrelevant. We only check the current system call context for capability,
>> > much like any other systemcall.
>> >
>> Eric explained the motivation [1] and [2] for this fix is:
>> A lesser privilege process A opens the fd (currently caps are not
>> checked), passes the fd to a higher privilege process B.
>
>> And somehow let process B pass the needed capabilities check for
>> resource creation, after which process A continue to use the
>> resource without capability.
>
> Yes, I'd say that is fine within our model, and may even be desirable
> in some cases.
>
> We don't use a file descriptor linked security model, it is always
> secured based on the individual ioctl system call. The file descriptor
> is just a way to route the system calls.
>
> The "setuid cat" risk is interesting, but we are supposed to be
> preventing that by using ioctl, no 'cat' program is going to randomly
> execute ioctls on stdout.

I guess I see a few places where inifiniband uses ioctl.

There are also a lot of places where inifinband uses raw read/write on
file descriptors.  I think last time I looked infiniband wasn't even using
ioctl.

Now maybe using an ioctl is the best you can do at this point, because
of some backwards compatibility. 

> You would not say that if process B creates a CAP_NET_RAW socket FD
> and passes it to process A without CAP_NET_RAW then A should not be
> able to use the FD.

But that is exactly what the infiniband security check were are talking
about appears to be doing.  It is using the credentials of process A
and failing after it was passed by process B.

> The same principle holds here too, the object handles scoped inside
> the FD should have the same kind of security properties as a normal FD
> would.

Which is fine as far as I understand it is fine.  The creation check is
what we were talking about.

Taking from your example above.  If process B with CAP_NET_RAW creates a
FD for opening queue pairs and passes it to process A without
CAP_NET_RAW then A is not able to create queue pairs.

That is what the code in
drivers/infiniband/core/ubvers_cmd.c:create_qp() currenty says.

That is what has us confused.  Exactly the kind of thing you said should
not be happening is happening.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-18 20:00       ` Eric W. Biederman
@ 2025-03-18 22:57         ` Jason Gunthorpe
  2025-04-04 14:53           ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 22:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, serge@hallyn.com,
	Leon Romanovsky

On Tue, Mar 18, 2025 at 03:00:15PM -0500, Eric W. Biederman wrote:

> There are also a lot of places where inifinband uses raw read/write on
> file descriptors.  I think last time I looked infiniband wasn't even using
> ioctl.

Yeah, that's all deprecated now, and it had some major security issue
with the 'setuid cat' attack. IIRC it was mitigated by disallowing
read/write from a process with different credentials than the process
that opened the FD. This caused regressions which were resolved by
moving to ioctl.

Today you can compile the read/write interface out of the kernel - for
the last uh 6 years or so the userspace has exclusively used ioctl.

> > You would not say that if process B creates a CAP_NET_RAW socket FD
> > and passes it to process A without CAP_NET_RAW then A should not be
> > able to use the FD.
> 
> But that is exactly what the infiniband security check were are talking
> about appears to be doing.  It is using the credentials of process A
> and failing after it was passed by process B.

I'm not sure what you are refering too? The model should be that the
process invoking the system call is the one that provides the
capability set.

It is entirely possible that the code is wrong, but the above was the
intention.

> Taking from your example above.  If process B with CAP_NET_RAW creates a
> FD for opening queue pairs and passes it to process A without
> CAP_NET_RAW then A is not able to create queue pairs.

Yes that's right, because the FD itself has no security properties at
all, it is just a conduit for calling into the kernel.

Process A cannot create raw queue pairs in the same way that Process A
cannot create raw sockets, it doesn't matter what where the FD came
from.

> That is what the code in
> drivers/infiniband/core/ubvers_cmd.c:create_qp() currenty says.

I'm not sure what you are referring to here? That function is called
on the system call path, and at least the intention was that this:

        case IB_QPT_RAW_PACKET:
                if (!capable(CAP_NET_RAW))
                        return -EPERM;
                break;

Would check the current task invoking the system call to see if that
task has the required capability.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-03-18 22:57         ` Jason Gunthorpe
@ 2025-04-04 14:53           ` Parav Pandit
  2025-04-04 15:13             ` Jason Gunthorpe
  2025-04-21  3:13             ` Serge E. Hallyn
  0 siblings, 2 replies; 58+ messages in thread
From: Parav Pandit @ 2025-04-04 14:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Eric W. Biederman
  Cc: linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	serge@hallyn.com, Leon Romanovsky

Hi Eric, Jason,

I would like to resume and conclude this discussion.

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 19, 2025 4:27 AM

> On Tue, Mar 18, 2025 at 03:00:15PM -0500, Eric W. Biederman wrote:
> 
> > There are also a lot of places where inifinband uses raw read/write on
> > file descriptors.  I think last time I looked infiniband wasn't even
> > using ioctl.
> 
> Yeah, that's all deprecated now, and it had some major security issue with the
> 'setuid cat' attack. IIRC it was mitigated by disallowing read/write from a
> process with different credentials than the process that opened the FD. This
> caused regressions which were resolved by moving to ioctl.
> 
> Today you can compile the read/write interface out of the kernel - for the last
> uh 6 years or so the userspace has exclusively used ioctl.
> 
> > > You would not say that if process B creates a CAP_NET_RAW socket FD
> > > and passes it to process A without CAP_NET_RAW then A should not be
> > > able to use the FD.
> >
> > But that is exactly what the infiniband security check were are
> > talking about appears to be doing.  It is using the credentials of
> > process A and failing after it was passed by process B.
> 
> I'm not sure what you are refering too? The model should be that the process
> invoking the system call is the one that provides the capability set.
> 
> It is entirely possible that the code is wrong, but the above was the intention.
> 
> > Taking from your example above.  If process B with CAP_NET_RAW creates
> > a FD for opening queue pairs and passes it to process A without
> > CAP_NET_RAW then A is not able to create queue pairs.
> 
> Yes that's right, because the FD itself has no security properties at all, it is just
> a conduit for calling into the kernel.
> 
> Process A cannot create raw queue pairs in the same way that Process A
> cannot create raw sockets, it doesn't matter what where the FD came from.
> 
> > That is what the code in
> > drivers/infiniband/core/ubvers_cmd.c:create_qp() currenty says.
> 
> I'm not sure what you are referring to here? That function is called on the
> system call path, and at least the intention was that this:
> 
>         case IB_QPT_RAW_PACKET:
>                 if (!capable(CAP_NET_RAW))
>                         return -EPERM;
>                 break;
> 
> Would check the current task invoking the system call to see if that task has
> the required capability.
> 
> Jason

To summarize,

1. A process can open an RDMA resource (such as a raw QP, raw flow entry, or similar 'raw' resource)
through the fd using ioctl(), if it has the appropriate capability, which in this case is CAP_NET_RAW.
This is similar to a process that opens a raw socket.

2. Given that RDMA uses ioctl() for resource creation, there isn't a security concern surrounding
the read()/write() system calls.

3. If process A, which does not have CAP_NET_RAW, passes the opened fd to another privileged
process B, which has CAP_NET_RAW, process B can open the raw RDMA resource.
This is still within the kernel-defined security boundary, similar to a raw socket.

4. If process A, which has the CAP_NET_RAW capability, passes the file descriptor to Process B, which does not have CAP_NET_RAW, Process B will not be able to open the raw RDMA resource.

Do we agree on this Eric?

Assuming yes, to extend this, further,

5. the process's capability check should be done in the right user namespace.
(instead of current in default user ns).
The right user namespace is the one which created the net namespace.
This is because rdma networking resources are governed by the net namespace.

Above #5 aligns with the example from existing kernel doc snippet below [1] and few kernel examples of [2].

For example, suppose that a process attempts to change
       the hostname (sethostname(2)), a resource governed by the UTS
       namespace.  In this case, the kernel will determine which user
       namespace owns the process's UTS namespace, and check whether the
       process has the required capability (CAP_SYS_ADMIN) in that user
       namespace.

[1] https://man7.org/linux/man-pages/man7/user_namespaces.7.html

[2] examples snippet that follows above guidance of #5.

File: drivers/infiniband/core/device.c  
Function: ib_device_set_netns_put()
For net namespace:

         if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
                 ret = -EPERM;
                 goto ns_err;
         }
 
File: fs/namespace.c 
For mount namespace:
        if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
                goto out;
        if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
                goto out;
 
For uts ns:
 static int utsns_install(struct nsset *nsset, struct ns_common *new)
 {
         struct nsproxy *nsproxy = nsset->nsproxy;
         struct uts_namespace *ns = to_uts_ns(new);

         if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
             !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
                 return -EPERM;
 
For net ns:
File: net/core/dev_ioctl.c
         case SIOCSHWTSTAMP:
                 if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
                         return -EPERM;
                 fallthrough;
 
static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
{
         int ret;

         if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
                 return -EPERM;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-04 14:53           ` Parav Pandit
@ 2025-04-04 15:13             ` Jason Gunthorpe
  2025-04-06 14:15               ` Serge E. Hallyn
  2025-04-21  3:13             ` Serge E. Hallyn
  1 sibling, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-04 15:13 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, serge@hallyn.com,
	Leon Romanovsky

On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> To summarize,
> 
> 1. A process can open an RDMA resource (such as a raw QP, raw flow entry, or similar 'raw' resource)
> through the fd using ioctl(), if it has the appropriate capability, which in this case is CAP_NET_RAW.
> This is similar to a process that opens a raw socket.
> 
> 2. Given that RDMA uses ioctl() for resource creation, there isn't a security concern surrounding
> the read()/write() system calls.
> 
> 3. If process A, which does not have CAP_NET_RAW, passes the opened fd to another privileged
> process B, which has CAP_NET_RAW, process B can open the raw RDMA resource.
> This is still within the kernel-defined security boundary, similar to a raw socket.
> 
> 4. If process A, which has the CAP_NET_RAW capability, passes the file descriptor to Process B, which does not have CAP_NET_RAW, Process B will not be able to open the raw RDMA resource.
> 
> Do we agree on this Eric?

This is our model, I consider it uAPI, so I don't belive we can change
it without an extreme reason..

> 5. the process's capability check should be done in the right user namespace.
> (instead of current in default user ns).
> The right user namespace is the one which created the net namespace.
> This is because rdma networking resources are governed by the net namespace.

This all makes my head hurt. The right user namespace is the one that
is currently active for the invoking process, I couldn't understand
why we have net namespaces refer to user namespaces :\
 
Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-04 15:13             ` Jason Gunthorpe
@ 2025-04-06 14:15               ` Serge E. Hallyn
  2025-04-07 11:16                 ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-06 14:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, serge@hallyn.com,
	Leon Romanovsky

On Fri, Apr 04, 2025 at 12:13:47PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > To summarize,
> > 
> > 1. A process can open an RDMA resource (such as a raw QP, raw flow entry, or similar 'raw' resource)
> > through the fd using ioctl(), if it has the appropriate capability, which in this case is CAP_NET_RAW.
> > This is similar to a process that opens a raw socket.
> > 
> > 2. Given that RDMA uses ioctl() for resource creation, there isn't a security concern surrounding
> > the read()/write() system calls.
> > 
> > 3. If process A, which does not have CAP_NET_RAW, passes the opened fd to another privileged
> > process B, which has CAP_NET_RAW, process B can open the raw RDMA resource.
> > This is still within the kernel-defined security boundary, similar to a raw socket.
> > 
> > 4. If process A, which has the CAP_NET_RAW capability, passes the file descriptor to Process B, which does not have CAP_NET_RAW, Process B will not be able to open the raw RDMA resource.
> > 
> > Do we agree on this Eric?
> 
> This is our model, I consider it uAPI, so I don't belive we can change
> it without an extreme reason..
> 
> > 5. the process's capability check should be done in the right user namespace.
> > (instead of current in default user ns).
> > The right user namespace is the one which created the net namespace.
> > This is because rdma networking resources are governed by the net namespace.
> 
> This all makes my head hurt. The right user namespace is the one that
> is currently active for the invoking process, I couldn't understand
> why we have net namespaces refer to user namespaces :\

A user at any time can create a new user namespace, without creating a
new network namespace, and have privilege in that user namespace, over
resources owned by the user namespace.

So if a user can create a new user namespace, then say "hey I have
CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
resources belonging to my current_net_ns", that's a problem.

So that's why the check should be ns_capable(device->net->user-ns, CAP_NET_ADMIN)
and not ns_capable(current_user_ns, CAP_NET_ADMIN).

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-06 14:15               ` Serge E. Hallyn
@ 2025-04-07 11:16                 ` Parav Pandit
  2025-04-07 14:46                   ` sergeh
  2025-04-07 16:12                   ` Jason Gunthorpe
  0 siblings, 2 replies; 58+ messages in thread
From: Parav Pandit @ 2025-04-07 11:16 UTC (permalink / raw)
  To: Serge E. Hallyn, Jason Gunthorpe
  Cc: Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Sunday, April 6, 2025 7:45 PM
> 
> On Fri, Apr 04, 2025 at 12:13:47PM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > To summarize,
> > >
> > > 1. A process can open an RDMA resource (such as a raw QP, raw flow
> > > entry, or similar 'raw' resource) through the fd using ioctl(), if it has the
> appropriate capability, which in this case is CAP_NET_RAW.
> > > This is similar to a process that opens a raw socket.
> > >
> > > 2. Given that RDMA uses ioctl() for resource creation, there isn't a
> > > security concern surrounding the read()/write() system calls.
> > >
> > > 3. If process A, which does not have CAP_NET_RAW, passes the opened
> > > fd to another privileged process B, which has CAP_NET_RAW, process B
> can open the raw RDMA resource.
> > > This is still within the kernel-defined security boundary, similar to a raw
> socket.
> > >
> > > 4. If process A, which has the CAP_NET_RAW capability, passes the file
> descriptor to Process B, which does not have CAP_NET_RAW, Process B will
> not be able to open the raw RDMA resource.
> > >
> > > Do we agree on this Eric?
> >
> > This is our model, I consider it uAPI, so I don't belive we can change
> > it without an extreme reason..
> >
> > > 5. the process's capability check should be done in the right user
> namespace.
> > > (instead of current in default user ns).
> > > The right user namespace is the one which created the net namespace.
> > > This is because rdma networking resources are governed by the net
> namespace.
> >
> > This all makes my head hurt. The right user namespace is the one that
> > is currently active for the invoking process, I couldn't understand
> > why we have net namespaces refer to user namespaces :\
> 
> A user at any time can create a new user namespace, without creating a new
> network namespace, and have privilege in that user namespace, over
> resources owned by the user namespace.
>
 
> So if a user can create a new user namespace, then say "hey I have
> CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
> resources belonging to my current_net_ns", that's a problem.
> 
> So that's why the check should be ns_capable(device->net->user-ns,
> CAP_NET_ADMIN) and not ns_capable(current_user_ns, CAP_NET_ADMIN).
>
Given the check is of the process (and hence user and net ns) and not of the rdma device itself,
Shouldn't we just check,

ns_capable(current->nsproxy->user_ns, ...)

This ensures current network namespace's owning user ns is consulted.
 
> -serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-07 11:16                 ` Parav Pandit
@ 2025-04-07 14:46                   ` sergeh
  2025-04-20 12:30                     ` Parav Pandit
  2025-04-07 16:12                   ` Jason Gunthorpe
  1 sibling, 1 reply; 58+ messages in thread
From: sergeh @ 2025-04-07 14:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Jason Gunthorpe, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Mon, Apr 07, 2025 at 11:16:35AM +0000, Parav Pandit wrote:
> > From: Serge E. Hallyn <serge@hallyn.com>
> > Sent: Sunday, April 6, 2025 7:45 PM
> > 
> > On Fri, Apr 04, 2025 at 12:13:47PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > > To summarize,
> > > >
> > > > 1. A process can open an RDMA resource (such as a raw QP, raw flow
> > > > entry, or similar 'raw' resource) through the fd using ioctl(), if it has the
> > appropriate capability, which in this case is CAP_NET_RAW.
> > > > This is similar to a process that opens a raw socket.
> > > >
> > > > 2. Given that RDMA uses ioctl() for resource creation, there isn't a
> > > > security concern surrounding the read()/write() system calls.
> > > >
> > > > 3. If process A, which does not have CAP_NET_RAW, passes the opened
> > > > fd to another privileged process B, which has CAP_NET_RAW, process B
> > can open the raw RDMA resource.
> > > > This is still within the kernel-defined security boundary, similar to a raw
> > socket.
> > > >
> > > > 4. If process A, which has the CAP_NET_RAW capability, passes the file
> > descriptor to Process B, which does not have CAP_NET_RAW, Process B will
> > not be able to open the raw RDMA resource.
> > > >
> > > > Do we agree on this Eric?
> > >
> > > This is our model, I consider it uAPI, so I don't belive we can change
> > > it without an extreme reason..
> > >
> > > > 5. the process's capability check should be done in the right user
> > namespace.
> > > > (instead of current in default user ns).
> > > > The right user namespace is the one which created the net namespace.
> > > > This is because rdma networking resources are governed by the net
> > namespace.
> > >
> > > This all makes my head hurt. The right user namespace is the one that
> > > is currently active for the invoking process, I couldn't understand
> > > why we have net namespaces refer to user namespaces :\
> > 
> > A user at any time can create a new user namespace, without creating a new
> > network namespace, and have privilege in that user namespace, over
> > resources owned by the user namespace.
> >
>  
> > So if a user can create a new user namespace, then say "hey I have
> > CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
> > resources belonging to my current_net_ns", that's a problem.
> > 
> > So that's why the check should be ns_capable(device->net->user-ns,
> > CAP_NET_ADMIN) and not ns_capable(current_user_ns, CAP_NET_ADMIN).
> >
> Given the check is of the process (and hence user and net ns) and not of the rdma device itself,
> Shouldn't we just check,
> 
> ns_capable(current->nsproxy->user_ns, ...)
> 
> This ensures current network namespace's owning user ns is consulted.

No, it does not.  If I do

unshare -U

then current->nsproxy->user_ns is not my current network namespace's
owning user ns.

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-07 11:16                 ` Parav Pandit
  2025-04-07 14:46                   ` sergeh
@ 2025-04-07 16:12                   ` Jason Gunthorpe
  2025-04-08 14:44                     ` Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-07 16:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Mon, Apr 07, 2025 at 11:16:35AM +0000, Parav Pandit wrote:
> > > This all makes my head hurt. The right user namespace is the one that
> > > is currently active for the invoking process, I couldn't understand
> > > why we have net namespaces refer to user namespaces :\
> > 
> > A user at any time can create a new user namespace, without creating a new
> > network namespace, and have privilege in that user namespace, over
> > resources owned by the user namespace.
>  
> > So if a user can create a new user namespace, then say "hey I have
> > CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
> > resources belonging to my current_net_ns", that's a problem.

But why is that possible? If the current user name space does not have
CAP_NET_ADMIN then why can it create a new user name space that does?

And if userspace does have CAP_NET_ADMIN what is the issue with
creating more user namespaces that also have it?

> > So that's why the check should be ns_capable(device->net->user-ns,
> > CAP_NET_ADMIN) and not ns_capable(current_user_ns, CAP_NET_ADMIN).
> >
> Given the check is of the process (and hence user and net ns) and not of the rdma device itself,
> Shouldn't we just check,
> 
> ns_capable(current->nsproxy->user_ns, ...)
> 
> This ensures current network namespace's owning user ns is consulted.

It sounds like the design does not store the capabilities inside the
current user_ns, but it logically stores them in other NSs. Ie all the
net related capabilities are in the netns.

Presumably then we have a mapping of every capability to the proper
namespace to store it?

If the container has a user namespace and the net ns uses the same
user namespace then you get the appearance of user namespace
controlled capabilities...

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-07 16:12                   ` Jason Gunthorpe
@ 2025-04-08 14:44                     ` Eric W. Biederman
  0 siblings, 0 replies; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-08 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Mon, Apr 07, 2025 at 11:16:35AM +0000, Parav Pandit wrote:
>> > > This all makes my head hurt. The right user namespace is the one that
>> > > is currently active for the invoking process, I couldn't understand
>> > > why we have net namespaces refer to user namespaces :\
>> > 
>> > A user at any time can create a new user namespace, without creating a new
>> > network namespace, and have privilege in that user namespace, over
>> > resources owned by the user namespace.
>>  
>> > So if a user can create a new user namespace, then say "hey I have
>> > CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
>> > resources belonging to my current_net_ns", that's a problem.
>
> But why is that possible? If the current user name space does not have
> CAP_NET_ADMIN then why can it create a new user name space that does?

Because it isn't CAP_NET_ADMIN.  The capabilities are per user
namespace.

AKA the pair (&init_user_ns, CAP_NET_ADMIN) is what you think of when
you think of CAP_NET_ADMIN.

The reason for this is a lot of things that capabilities guard are only
semantically a problem because it would confuse preexisting suid root
binaries.  Binding to the low ports (for example) is no more special
than binding to any other port, except that assumptions can be made
about who has bound to the low ports.

So if you can restrict binding to the low ports only to network
namespaces that you and your children control so there is no change
of confusing a suid root application that it is a legitimate operation
to perform.

In networking terms the user namespace and the subordinate namespace
created with user namespace permissions are a bit like a tunnel.  The
users of a tunnel can do anything inside their tunnel assign IP
addresses etc, and no one will care as long as it all stays inside the
tunnel.

So in essence the question is do you have capabilities within the tunnel
or do you have capabilities outside of a tunnel.

Do to historical silliness there is a practical concern about code that
only root could run.  People tend not to worry if there are bugs that
allow such code to do unintended things.  So even if semantically it is
safe to allow such code, generally the code needs a bit of an audit to
make certain there are not bugs or implementation assumptions that will
be violated when allowing additional functionality in a user namespace.

> And if userspace does have CAP_NET_ADMIN what is the issue with
> creating more user namespaces that also have it?
>
>> > So that's why the check should be ns_capable(device->net->user-ns,
>> > CAP_NET_ADMIN) and not ns_capable(current_user_ns, CAP_NET_ADMIN).
>> >
>> Given the check is of the process (and hence user and net ns) and not of the rdma device itself,
>> Shouldn't we just check,
>> 
>> ns_capable(current->nsproxy->user_ns, ...)
>> 
>> This ensures current network namespace's owning user ns is consulted.
>
> It sounds like the design does not store the capabilities inside the
> current user_ns, but it logically stores them in other NSs. Ie all the
> net related capabilities are in the netns.
>
> Presumably then we have a mapping of every capability to the proper
> namespace to store it?

Store is the wrong concept.  Namespaces remember which user namespace
they were created from.  This allows the capability checks to require
that you have the capability in the user namespace that created them,
or in a parent user namespace.

There exists a full set of capabilities that can be present in
a user namespace.  The initial process in a user namespace is given
all of those capabilities in it's struct cred.  Just like the init
process is given all capabilities at system start.  The difference
is that when all you have are capabilities that are limited to
a user namespace they don't allow anything to be done (other than
creating namespaces) unless some namespaces are created from that
user namespace.

> If the container has a user namespace and the net ns uses the same
> user namespace then you get the appearance of user namespace
> controlled capabilities...

Essentially yes.

That network namespace requires CAP_NET_ADMIN in the user namespace
it was created within (or a parent user namespace), for it's capability
checks.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-07 14:46                   ` sergeh
@ 2025-04-20 12:30                     ` Parav Pandit
  2025-04-20 13:41                       ` Serge E. Hallyn
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-20 12:30 UTC (permalink / raw)
  To: sergeh@kernel.org
  Cc: Serge E. Hallyn, Jason Gunthorpe, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky



> From: sergeh@kernel.org <sergeh@kernel.org>
> Sent: Monday, April 7, 2025 8:17 PM
> 
> On Mon, Apr 07, 2025 at 11:16:35AM +0000, Parav Pandit wrote:
> > > From: Serge E. Hallyn <serge@hallyn.com>
> > > Sent: Sunday, April 6, 2025 7:45 PM
> > >
> > > On Fri, Apr 04, 2025 at 12:13:47PM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > > > To summarize,
> > > > >
> > > > > 1. A process can open an RDMA resource (such as a raw QP, raw
> > > > > flow entry, or similar 'raw' resource) through the fd using
> > > > > ioctl(), if it has the
> > > appropriate capability, which in this case is CAP_NET_RAW.
> > > > > This is similar to a process that opens a raw socket.
> > > > >
> > > > > 2. Given that RDMA uses ioctl() for resource creation, there
> > > > > isn't a security concern surrounding the read()/write() system calls.
> > > > >
> > > > > 3. If process A, which does not have CAP_NET_RAW, passes the
> > > > > opened fd to another privileged process B, which has
> > > > > CAP_NET_RAW, process B
> > > can open the raw RDMA resource.
> > > > > This is still within the kernel-defined security boundary,
> > > > > similar to a raw
> > > socket.
> > > > >
> > > > > 4. If process A, which has the CAP_NET_RAW capability, passes
> > > > > the file
> > > descriptor to Process B, which does not have CAP_NET_RAW, Process B
> > > will not be able to open the raw RDMA resource.
> > > > >
> > > > > Do we agree on this Eric?
> > > >
> > > > This is our model, I consider it uAPI, so I don't belive we can
> > > > change it without an extreme reason..
> > > >
> > > > > 5. the process's capability check should be done in the right
> > > > > user
> > > namespace.
> > > > > (instead of current in default user ns).
> > > > > The right user namespace is the one which created the net namespace.
> > > > > This is because rdma networking resources are governed by the
> > > > > net
> > > namespace.
> > > >
> > > > This all makes my head hurt. The right user namespace is the one
> > > > that is currently active for the invoking process, I couldn't
> > > > understand why we have net namespaces refer to user namespaces :\
> > >
> > > A user at any time can create a new user namespace, without creating
> > > a new network namespace, and have privilege in that user namespace,
> > > over resources owned by the user namespace.
> > >
> >
> > > So if a user can create a new user namespace, then say "hey I have
> > > CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
> > > resources belonging to my current_net_ns", that's a problem.
> > >
> > > So that's why the check should be ns_capable(device->net->user-ns,
> > > CAP_NET_ADMIN) and not ns_capable(current_user_ns,
> CAP_NET_ADMIN).
> > >
> > Given the check is of the process (and hence user and net ns) and not
> > of the rdma device itself, Shouldn't we just check,
> >
> > ns_capable(current->nsproxy->user_ns, ...)
> >
> > This ensures current network namespace's owning user ns is consulted.
> 
> No, it does not.  If I do
> 
> unshare -U
> 
> then current->nsproxy->user_ns is not my current network namespace's
> owning user ns.
>
It should be current->nsproxy->net->user_ns.
This ensures that it is always current network namespace's owning user ns is considered.
Right?

Sorry for the late response.
I wasn't well for few days followed by backlog.
 
> -serge


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-20 12:30                     ` Parav Pandit
@ 2025-04-20 13:41                       ` Serge E. Hallyn
  2025-04-20 17:31                         ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-20 13:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: sergeh@kernel.org, Serge E. Hallyn, Jason Gunthorpe,
	Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Sun, Apr 20, 2025 at 12:30:37PM +0000, Parav Pandit wrote:
> 
> 
> > From: sergeh@kernel.org <sergeh@kernel.org>
> > Sent: Monday, April 7, 2025 8:17 PM
> > 
> > On Mon, Apr 07, 2025 at 11:16:35AM +0000, Parav Pandit wrote:
> > > > From: Serge E. Hallyn <serge@hallyn.com>
> > > > Sent: Sunday, April 6, 2025 7:45 PM
> > > >
> > > > On Fri, Apr 04, 2025 at 12:13:47PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > > > > To summarize,
> > > > > >
> > > > > > 1. A process can open an RDMA resource (such as a raw QP, raw
> > > > > > flow entry, or similar 'raw' resource) through the fd using
> > > > > > ioctl(), if it has the
> > > > appropriate capability, which in this case is CAP_NET_RAW.
> > > > > > This is similar to a process that opens a raw socket.
> > > > > >
> > > > > > 2. Given that RDMA uses ioctl() for resource creation, there
> > > > > > isn't a security concern surrounding the read()/write() system calls.
> > > > > >
> > > > > > 3. If process A, which does not have CAP_NET_RAW, passes the
> > > > > > opened fd to another privileged process B, which has
> > > > > > CAP_NET_RAW, process B
> > > > can open the raw RDMA resource.
> > > > > > This is still within the kernel-defined security boundary,
> > > > > > similar to a raw
> > > > socket.
> > > > > >
> > > > > > 4. If process A, which has the CAP_NET_RAW capability, passes
> > > > > > the file
> > > > descriptor to Process B, which does not have CAP_NET_RAW, Process B
> > > > will not be able to open the raw RDMA resource.
> > > > > >
> > > > > > Do we agree on this Eric?
> > > > >
> > > > > This is our model, I consider it uAPI, so I don't belive we can
> > > > > change it without an extreme reason..
> > > > >
> > > > > > 5. the process's capability check should be done in the right
> > > > > > user
> > > > namespace.
> > > > > > (instead of current in default user ns).
> > > > > > The right user namespace is the one which created the net namespace.
> > > > > > This is because rdma networking resources are governed by the
> > > > > > net
> > > > namespace.
> > > > >
> > > > > This all makes my head hurt. The right user namespace is the one
> > > > > that is currently active for the invoking process, I couldn't
> > > > > understand why we have net namespaces refer to user namespaces :\
> > > >
> > > > A user at any time can create a new user namespace, without creating
> > > > a new network namespace, and have privilege in that user namespace,
> > > > over resources owned by the user namespace.
> > > >
> > >
> > > > So if a user can create a new user namespace, then say "hey I have
> > > > CAP_NET_ADMIN over current_user_ns, so give me access to the RDMA
> > > > resources belonging to my current_net_ns", that's a problem.
> > > >
> > > > So that's why the check should be ns_capable(device->net->user-ns,
> > > > CAP_NET_ADMIN) and not ns_capable(current_user_ns,
> > CAP_NET_ADMIN).
> > > >
> > > Given the check is of the process (and hence user and net ns) and not
> > > of the rdma device itself, Shouldn't we just check,
> > >
> > > ns_capable(current->nsproxy->user_ns, ...)
> > >
> > > This ensures current network namespace's owning user ns is consulted.
> > 
> > No, it does not.  If I do
> > 
> > unshare -U
> > 
> > then current->nsproxy->user_ns is not my current network namespace's
> > owning user ns.
> >
> It should be current->nsproxy->net->user_ns.
> This ensures that it is always current network namespace's owning user ns is considered.
> Right?
> 
> Sorry for the late response.
> I wasn't well for few days followed by backlog.

Hi,

That will depend on exactly what you're checking permissions for.
It looks like ib_uverbs_ex_create_flow() gets passed a
uverbs_attr_bundle pointer that has a context which holds the
thing you're actually checking permissions towards?  And I'm
assuming that that thing is actually a file?  So again, if the
task can create the "thing" first, then unshare its network
namespace, then cause this permission to be checked, or if
it can accept a file over unix socket or whatever that someone
else opened, then current->nsproxy->net->user_ns may *not* be
relevant.  If, however, the flow, later on, will ensure that
any actions are only relevant in the current network namespace,
then you are correct.

I just can't tell in this flow.  I"ll try to find some time to
track it down more.

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-20 13:41                       ` Serge E. Hallyn
@ 2025-04-20 17:31                         ` Parav Pandit
  0 siblings, 0 replies; 58+ messages in thread
From: Parav Pandit @ 2025-04-20 17:31 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: sergeh@kernel.org, Jason Gunthorpe, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky



> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Sunday, April 20, 2025 7:12 PM
> 
> On Sun, Apr 20, 2025 at 12:30:37PM +0000, Parav Pandit wrote:
> >
> >
> > > From: sergeh@kernel.org <sergeh@kernel.org>
> > > Sent: Monday, April 7, 2025 8:17 PM
> > >
> > > On Mon, Apr 07, 2025 at 11:16:35AM +0000, Parav Pandit wrote:
> > > > > From: Serge E. Hallyn <serge@hallyn.com>
> > > > > Sent: Sunday, April 6, 2025 7:45 PM
> > > > >
> > > > > On Fri, Apr 04, 2025 at 12:13:47PM -0300, Jason Gunthorpe wrote:
> > > > > > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > > > > > To summarize,
> > > > > > >
> > > > > > > 1. A process can open an RDMA resource (such as a raw QP,
> > > > > > > raw flow entry, or similar 'raw' resource) through the fd
> > > > > > > using ioctl(), if it has the
> > > > > appropriate capability, which in this case is CAP_NET_RAW.
> > > > > > > This is similar to a process that opens a raw socket.
> > > > > > >
> > > > > > > 2. Given that RDMA uses ioctl() for resource creation, there
> > > > > > > isn't a security concern surrounding the read()/write() system calls.
> > > > > > >
> > > > > > > 3. If process A, which does not have CAP_NET_RAW, passes the
> > > > > > > opened fd to another privileged process B, which has
> > > > > > > CAP_NET_RAW, process B
> > > > > can open the raw RDMA resource.
> > > > > > > This is still within the kernel-defined security boundary,
> > > > > > > similar to a raw
> > > > > socket.
> > > > > > >
> > > > > > > 4. If process A, which has the CAP_NET_RAW capability,
> > > > > > > passes the file
> > > > > descriptor to Process B, which does not have CAP_NET_RAW,
> > > > > Process B will not be able to open the raw RDMA resource.
> > > > > > >
> > > > > > > Do we agree on this Eric?
> > > > > >
> > > > > > This is our model, I consider it uAPI, so I don't belive we
> > > > > > can change it without an extreme reason..
> > > > > >
> > > > > > > 5. the process's capability check should be done in the
> > > > > > > right user
> > > > > namespace.
> > > > > > > (instead of current in default user ns).
> > > > > > > The right user namespace is the one which created the net
> namespace.
> > > > > > > This is because rdma networking resources are governed by
> > > > > > > the net
> > > > > namespace.
> > > > > >
> > > > > > This all makes my head hurt. The right user namespace is the
> > > > > > one that is currently active for the invoking process, I
> > > > > > couldn't understand why we have net namespaces refer to user
> > > > > > namespaces :\
> > > > >
> > > > > A user at any time can create a new user namespace, without
> > > > > creating a new network namespace, and have privilege in that
> > > > > user namespace, over resources owned by the user namespace.
> > > > >
> > > >
> > > > > So if a user can create a new user namespace, then say "hey I
> > > > > have CAP_NET_ADMIN over current_user_ns, so give me access to
> > > > > the RDMA resources belonging to my current_net_ns", that's a
> problem.
> > > > >
> > > > > So that's why the check should be
> > > > > ns_capable(device->net->user-ns,
> > > > > CAP_NET_ADMIN) and not ns_capable(current_user_ns,
> > > CAP_NET_ADMIN).
> > > > >
> > > > Given the check is of the process (and hence user and net ns) and
> > > > not of the rdma device itself, Shouldn't we just check,
> > > >
> > > > ns_capable(current->nsproxy->user_ns, ...)
> > > >
> > > > This ensures current network namespace's owning user ns is consulted.
> > >
> > > No, it does not.  If I do
> > >
> > > unshare -U
> > >
> > > then current->nsproxy->user_ns is not my current network namespace's
> > > owning user ns.
> > >
> > It should be current->nsproxy->net->user_ns.
> > This ensures that it is always current network namespace's owning user ns is
> considered.
> > Right?
> >
> 
> Hi,
> 
> That will depend on exactly what you're checking permissions for.
> It looks like ib_uverbs_ex_create_flow() gets passed a uverbs_attr_bundle
> pointer that has a context which holds the thing you're actually checking
> permissions towards?  And I'm assuming that that thing is actually a file?  

File is just a conduit calling into the kernel for a specific device.
And ucontext is first object where other objects are anchored such as flow done using ex_create_flow().
So what we really want to check on the ioctl() is, if the calling process has necessary capability in its user namespace.

> So
> again, if the task can create the "thing" first, then unshare its network
> namespace, then cause this permission to be checked, or if it can accept a file
> over unix socket or whatever that someone else opened, then current-
> >nsproxy->net->user_ns may *not* be relevant.  
As Jason explained, it can receive or it can unshare too.
As long as the process invoking ioctl() has the capability, it should be able to do it.
Jason explained the rdma fd  model in [1] 

[1] https://lore.kernel.org/linux-rdma/20250420134144.GA575032@mail.hallyn.com/T/#m1c84babbc593b255c6a4fdebb6c65651717a75f7

> If, however, the flow, later
> on, will ensure that any actions are only relevant in the current network
> namespace, then you are correct.
> 
Indeed. Flow and its operation are in the net namespace of the process.

> I just can't tell in this flow.  I"ll try to find some time to track it down more.
> 
> -serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-04 14:53           ` Parav Pandit
  2025-04-04 15:13             ` Jason Gunthorpe
@ 2025-04-21  3:13             ` Serge E. Hallyn
  2025-04-21 11:04               ` Parav Pandit
  1 sibling, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-21  3:13 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, serge@hallyn.com,
	Leon Romanovsky

On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> Hi Eric, Jason,

Hi,

I'm jumping back up the thread as I think this email best details the
things I'm confused about :)  Three questions below in two different
stanzas.

> To summarize,
> 
> 1. A process can open an RDMA resource (such as a raw QP, raw flow entry, or similar 'raw' resource)
> through the fd using ioctl(), if it has the appropriate capability, which in this case is CAP_NET_RAW.

Why does it need CAP_NET_RAW to create the resource, if the resource won't
be usable by a process without CAP_NET_RAW later anyway?  Is that legacy
for the read/write (vs ioctl) case?  Or is it to limit the number of
opened resources?  Or some other reason?

Is the resource which is created tied to the net namespce of the process
which created it?

> This is similar to a process that opens a raw socket.
> 
> 2. Given that RDMA uses ioctl() for resource creation, there isn't a security concern surrounding
> the read()/write() system calls.
> 
> 3. If process A, which does not have CAP_NET_RAW, passes the opened fd to another privileged
> process B, which has CAP_NET_RAW, process B can open the raw RDMA resource.
> This is still within the kernel-defined security boundary, similar to a raw socket.
> 
> 4. If process A, which has the CAP_NET_RAW capability, passes the file descriptor to Process B, which does not have CAP_NET_RAW, Process B will not be able to open the raw RDMA resource.
> 
> Do we agree on this Eric?
> 
> Assuming yes, to extend this, further,
> 
> 5. the process's capability check should be done in the right user namespace.
> (instead of current in default user ns).
> The right user namespace is the one which created the net namespace.

"the one which created THE net namespace" - which net namespace?   The
one in which the process which created the resource belonged, or the
one in which the current process (calling ioctl) belongs?

> This is because rdma networking resources are governed by the net namespace.
> 
> Above #5 aligns with the example from existing kernel doc snippet below [1] and few kernel examples of [2].
> 
> For example, suppose that a process attempts to change
>        the hostname (sethostname(2)), a resource governed by the UTS
>        namespace.  In this case, the kernel will determine which user
>        namespace owns the process's UTS namespace, and check whether the
>        process has the required capability (CAP_SYS_ADMIN) in that user
>        namespace.
> 
> [1] https://man7.org/linux/man-pages/man7/user_namespaces.7.html
> 
> [2] examples snippet that follows above guidance of #5.
> 
> File: drivers/infiniband/core/device.c  
> Function: ib_device_set_netns_put()
> For net namespace:
> 
>          if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
>                  ret = -EPERM;
>                  goto ns_err;
>          }
>  
> File: fs/namespace.c 
> For mount namespace:
>         if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
>                 goto out;
>         if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
>                 goto out;
>  
> For uts ns:
>  static int utsns_install(struct nsset *nsset, struct ns_common *new)
>  {
>          struct nsproxy *nsproxy = nsset->nsproxy;
>          struct uts_namespace *ns = to_uts_ns(new);
> 
>          if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
>              !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
>                  return -EPERM;
>  
> For net ns:
> File: net/core/dev_ioctl.c
>          case SIOCSHWTSTAMP:
>                  if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
>                          return -EPERM;
>                  fallthrough;
>  
> static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
> {
>          int ret;
> 
>          if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
>                  return -EPERM;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-21  3:13             ` Serge E. Hallyn
@ 2025-04-21 11:04               ` Parav Pandit
  2025-04-21 13:00                 ` Serge E. Hallyn
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-21 11:04 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jason Gunthorpe, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Monday, April 21, 2025 8:43 AM
> 
> On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > Hi Eric, Jason,
> 
> Hi,
> 
> I'm jumping back up the thread as I think this email best details the things I'm
> confused about :)  Three questions below in two different stanzas.
> 
> > To summarize,
> >
> > 1. A process can open an RDMA resource (such as a raw QP, raw flow
> > entry, or similar 'raw' resource) through the fd using ioctl(), if it has the
> appropriate capability, which in this case is CAP_NET_RAW.
> 
> Why does it need CAP_NET_RAW to create the resource, if the resource won't
> be usable by a process without CAP_NET_RAW later anyway?  
Once the resource is created, and the fd is shared (like a raw socket fd), it will be usable by a process without CAP_NET_RAW.
Is that a concern? If yes, how is it solved for raw socket fd? It appears to me that it is not.

> Is that legacy
> for the read/write (vs ioctl) case?  
No.

> Or is it to limit the number of opened
> resources?  Or some other reason?
> 
The resource enables to do raw operation, hence the capability check of the process for having NET_RAW cap.

> Is the resource which is created tied to the net namespce of the process
> which created it?
The resource is tied to the process. So if rdma device on which the resource is created, if rdma device net ns changes, then this resource will be destroyed.
The resource is associated with the process. Therefore, if the RDMA device on which the resource was created changes its network namespace, all the resources will be destroyed for the process.

> 
> > This is similar to a process that opens a raw socket.
> >
> > 2. Given that RDMA uses ioctl() for resource creation, there isn't a
> > security concern surrounding the read()/write() system calls.
> >
> > 3. If process A, which does not have CAP_NET_RAW, passes the opened fd
> > to another privileged process B, which has CAP_NET_RAW, process B can
> open the raw RDMA resource.
> > This is still within the kernel-defined security boundary, similar to a raw
> socket.
> >
> > 4. If process A, which has the CAP_NET_RAW capability, passes the file
> descriptor to Process B, which does not have CAP_NET_RAW, Process B will
> not be able to open the raw RDMA resource.
> >
> > Do we agree on this Eric?
> >
> > Assuming yes, to extend this, further,
> >
> > 5. the process's capability check should be done in the right user
> namespace.
> > (instead of current in default user ns).
> > The right user namespace is the one which created the net namespace.
> 
> "the one which created THE net namespace" - which net namespace?   The
> one in which the process which created the resource belonged, or the one in
> which the current process (calling ioctl) belongs?
When the ioctl() is invoked for resource creation, this process has its net namespace.
And this net ns has owner user ns.

In my understanding, the capability check is for the process's capability for a specific resource _type_.
And have nothing to do with the individual resource itself.

A sane flow in my view is,
a. create user namespace
b. create net namespace
c. move rdma device to net namespace done in #b
d. launch a process in user_ns of #a and net ns of #b and let it operate the device from there.

If the process after step #d, creates new net ns, or new user ns, any new ioctl() for resource creation, will check caps against the latest net/user ns.

> 
> > This is because rdma networking resources are governed by the net
> namespace.
> >
> > Above #5 aligns with the example from existing kernel doc snippet below [1]
> and few kernel examples of [2].
> >
> > For example, suppose that a process attempts to change
> >        the hostname (sethostname(2)), a resource governed by the UTS
> >        namespace.  In this case, the kernel will determine which user
> >        namespace owns the process's UTS namespace, and check whether the
> >        process has the required capability (CAP_SYS_ADMIN) in that user
> >        namespace.
> >
> > [1] https://man7.org/linux/man-pages/man7/user_namespaces.7.html
> >
> > [2] examples snippet that follows above guidance of #5.
> >
> > File: drivers/infiniband/core/device.c
> > Function: ib_device_set_netns_put()
> > For net namespace:
> >
> >          if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
> >                  ret = -EPERM;
> >                  goto ns_err;
> >          }
> >
> > File: fs/namespace.c
> > For mount namespace:
> >         if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
> >                 goto out;
> >         if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
> >                 goto out;
> >
> > For uts ns:
> >  static int utsns_install(struct nsset *nsset, struct ns_common *new)
> > {
> >          struct nsproxy *nsproxy = nsset->nsproxy;
> >          struct uts_namespace *ns = to_uts_ns(new);
> >
> >          if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> >              !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
> >                  return -EPERM;
> >
> > For net ns:
> > File: net/core/dev_ioctl.c
> >          case SIOCSHWTSTAMP:
> >                  if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
> >                          return -EPERM;
> >                  fallthrough;
> >
> > static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user
> > *user, int *len) {
> >          int ret;
> >
> >          if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
> >                  return -EPERM;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-21 11:04               ` Parav Pandit
@ 2025-04-21 13:00                 ` Serge E. Hallyn
  2025-04-21 13:33                   ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-21 13:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Jason Gunthorpe, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Mon, Apr 21, 2025 at 11:04:57AM +0000, Parav Pandit wrote:
> 
> > From: Serge E. Hallyn <serge@hallyn.com>
> > Sent: Monday, April 21, 2025 8:43 AM
> > 
> > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > Hi Eric, Jason,
> > 
> > Hi,
> > 
> > I'm jumping back up the thread as I think this email best details the things I'm
> > confused about :)  Three questions below in two different stanzas.
> > 
> > > To summarize,
> > >
> > > 1. A process can open an RDMA resource (such as a raw QP, raw flow
> > > entry, or similar 'raw' resource) through the fd using ioctl(), if it has the
> > appropriate capability, which in this case is CAP_NET_RAW.
> > 
> > Why does it need CAP_NET_RAW to create the resource, if the resource won't
> > be usable by a process without CAP_NET_RAW later anyway?  
> Once the resource is created, and the fd is shared (like a raw socket fd), it will be usable by a process without CAP_NET_RAW.
> Is that a concern? If yes, how is it solved for raw socket fd? It appears to me that it is not.
> 
> > Is that legacy
> > for the read/write (vs ioctl) case?  
> No.
> 
> > Or is it to limit the number of opened
> > resources?  Or some other reason?
> > 
> The resource enables to do raw operation, hence the capability check of the process for having NET_RAW cap.

Ok, so it seems to me that

1. the create should check ns_capable(current->nsproxy->net->user_ns, CAP_NET_RAW)
2. the read/write are a known escape, eventually to be removed?
3. the ioctl should check file_ns_capable(attrs->ufile->filp->f_cred->user_ns, CAP_NET_RAW)

Two notes about (3).  First, note that it's different from what you had.
It explicitly checks that the caller has CAP_NET_RAW against the net
namespace that was used to open the file.  Second, I'm suggesting this
because Jason does keep saying that ioctl is supposed to solve the
missing permission check.  If it really is felt that no permission
check should be needed, that's a different discussion.  I've just been
trying to figure out where the state should be tracked.

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-21 13:00                 ` Serge E. Hallyn
@ 2025-04-21 13:33                   ` Parav Pandit
  2025-04-21 17:22                     ` Serge E. Hallyn
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-21 13:33 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jason Gunthorpe, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky



> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Monday, April 21, 2025 6:30 PM
> 
> On Mon, Apr 21, 2025 at 11:04:57AM +0000, Parav Pandit wrote:
> >
> > > From: Serge E. Hallyn <serge@hallyn.com>
> > > Sent: Monday, April 21, 2025 8:43 AM
> > >
> > > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > > Hi Eric, Jason,
> > >
> > > Hi,
> > >
> > > I'm jumping back up the thread as I think this email best details
> > > the things I'm confused about :)  Three questions below in two different
> stanzas.
> > >
> > > > To summarize,
> > > >
> > > > 1. A process can open an RDMA resource (such as a raw QP, raw flow
> > > > entry, or similar 'raw' resource) through the fd using ioctl(), if
> > > > it has the
> > > appropriate capability, which in this case is CAP_NET_RAW.
> > >
> > > Why does it need CAP_NET_RAW to create the resource, if the resource
> > > won't be usable by a process without CAP_NET_RAW later anyway?
> > Once the resource is created, and the fd is shared (like a raw socket fd), it
> will be usable by a process without CAP_NET_RAW.
> > Is that a concern? If yes, how is it solved for raw socket fd? It appears to me
> that it is not.
> >
> > > Is that legacy
> > > for the read/write (vs ioctl) case?
> > No.
> >
> > > Or is it to limit the number of opened resources?  Or some other
> > > reason?
> > >
> > The resource enables to do raw operation, hence the capability check of the
> process for having NET_RAW cap.
> 
> Ok, so it seems to me that
> 
> 1. the create should check ns_capable(current->nsproxy->net->user_ns,
> CAP_NET_RAW) 
I believe this is sufficient as this create call happens through the ioctl().
But more question on #3.

> 2. the read/write are a known escape, eventually to be
> removed?
Write should be deprecated eventually.
Jason mentioned that write() can be compiled out of kernel.
I guess it needs new compile time config flag around [1].

[1] https://elixir.bootlin.com/linux/v6.14.3/source/drivers/infiniband/core/uverbs_main.c#L1037

> 3. the ioctl should check file_ns_capable(attrs->ufile->filp->f_cred->user_ns,
> CAP_NET_RAW)
> 
> Two notes about (3).  First, note that it's different from what you had.
> It explicitly checks that the caller has CAP_NET_RAW against the net
> namespace that was used to open the file.  
How is the net namespace linked in #3?
Is it because when file was opened, the rdma device was accessible in a given net ns?
But again the net ns explicitly not accessed in #3.

> Second, I'm suggesting this
> because Jason does keep saying that ioctl is supposed to solve the missing
> permission check.  
I don't understand how ioctl() is replacement to capability ns_capable() check.
Do you mean to delete the capable() call itself?
I likely misunderstood..

> If it really is felt that no permission check should be
> needed, that's a different discussion.  I've just been trying to figure out where
> the state should be tracked.
> 
> -serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-21 13:33                   ` Parav Pandit
@ 2025-04-21 17:22                     ` Serge E. Hallyn
  2025-04-22 12:46                       ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-21 17:22 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Jason Gunthorpe, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Mon, Apr 21, 2025 at 01:33:45PM +0000, Parav Pandit wrote:
> 
> 
> > From: Serge E. Hallyn <serge@hallyn.com>
> > Sent: Monday, April 21, 2025 6:30 PM
> > 
> > On Mon, Apr 21, 2025 at 11:04:57AM +0000, Parav Pandit wrote:
> > >
> > > > From: Serge E. Hallyn <serge@hallyn.com>
> > > > Sent: Monday, April 21, 2025 8:43 AM
> > > >
> > > > On Fri, Apr 04, 2025 at 02:53:30PM +0000, Parav Pandit wrote:
> > > > > Hi Eric, Jason,
> > > >
> > > > Hi,
> > > >
> > > > I'm jumping back up the thread as I think this email best details
> > > > the things I'm confused about :)  Three questions below in two different
> > stanzas.
> > > >
> > > > > To summarize,
> > > > >
> > > > > 1. A process can open an RDMA resource (such as a raw QP, raw flow
> > > > > entry, or similar 'raw' resource) through the fd using ioctl(), if
> > > > > it has the
> > > > appropriate capability, which in this case is CAP_NET_RAW.
> > > >
> > > > Why does it need CAP_NET_RAW to create the resource, if the resource
> > > > won't be usable by a process without CAP_NET_RAW later anyway?
> > > Once the resource is created, and the fd is shared (like a raw socket fd), it
> > will be usable by a process without CAP_NET_RAW.
> > > Is that a concern? If yes, how is it solved for raw socket fd? It appears to me
> > that it is not.
> > >
> > > > Is that legacy
> > > > for the read/write (vs ioctl) case?
> > > No.
> > >
> > > > Or is it to limit the number of opened resources?  Or some other
> > > > reason?
> > > >
> > > The resource enables to do raw operation, hence the capability check of the
> > process for having NET_RAW cap.
> > 
> > Ok, so it seems to me that
> > 
> > 1. the create should check ns_capable(current->nsproxy->net->user_ns,
> > CAP_NET_RAW) 
> I believe this is sufficient as this create call happens through the ioctl().
> But more question on #3.
> 
> > 2. the read/write are a known escape, eventually to be
> > removed?
> Write should be deprecated eventually.
> Jason mentioned that write() can be compiled out of kernel.
> I guess it needs new compile time config flag around [1].
> 
> [1] https://elixir.bootlin.com/linux/v6.14.3/source/drivers/infiniband/core/uverbs_main.c#L1037
> 
> > 3. the ioctl should check file_ns_capable(attrs->ufile->filp->f_cred->user_ns,
> > CAP_NET_RAW)
> > 
> > Two notes about (3).  First, note that it's different from what you had.
> > It explicitly checks that the caller has CAP_NET_RAW against the net
> > namespace that was used to open the file.  
> How is the net namespace linked in #3?
> Is it because when file was opened, the rdma device was accessible in a given net ns?
> But again the net ns explicitly not accessed in #3.

I'll have to look around and see if we can deduce the netns from elsewhere,
the device perhaps.  But IIUC the file's user_ns should be the one for
which we checked that it has CAP_NET_RAW over the actual net->user_ns,
so if you have CAP_NET_RAW in that user_ns, then you're good.  Where it
*could* get wonky is if the opener was in a parent userns of the net->userns.
In that case the file's userns will be sufficient to access the net, but
we could end up denying access from a privileged process in its child
user_ns, that is, potentially, the net->userns.

> > Second, I'm suggesting this
> > because Jason does keep saying that ioctl is supposed to solve the missing
> > permission check.  
> I don't understand how ioctl() is replacement to capability ns_capable() check.

I'm assuming the ioctl system call handler does the check.  I'll double-check.

> Do you mean to delete the capable() call itself?
> I likely misunderstood..
> 
> > If it really is felt that no permission check should be
> > needed, that's a different discussion.  I've just been trying to figure out where
> > the state should be tracked.
> > 
> > -serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-21 17:22                     ` Serge E. Hallyn
@ 2025-04-22 12:46                       ` Jason Gunthorpe
  2025-04-22 13:14                         ` Serge E. Hallyn
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-22 12:46 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Parav Pandit, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> > > 1. the create should check ns_capable(current->nsproxy->net->user_ns,
> > > CAP_NET_RAW) 
> > I believe this is sufficient as this create call happens through the ioctl().
> > But more question on #3.

I think this is the right one to use everywhere.

> > > 3. the ioctl should check file_ns_capable(attrs->ufile->filp->f_cred->user_ns,
> > > CAP_NET_RAW)
> > > 
> > > Two notes about (3).  First, note that it's different from what you had.
> > > It explicitly checks that the caller has CAP_NET_RAW against the net
> > > namespace that was used to open the file.  
> > How is the net namespace linked in #3?
> > Is it because when file was opened, the rdma device was accessible in a given net ns?
> > But again the net ns explicitly not accessed in #3.
> 
> I'll have to look around and see if we can deduce the netns from elsewhere,
> the device perhaps.  But IIUC the file's user_ns should be the one for
> which we checked that it has CAP_NET_RAW over the actual net->user_ns,
> so if you have CAP_NET_RAW in that user_ns, then you're good.  Where it
> *could* get wonky is if the opener was in a parent userns of the net->userns.
> In that case the file's userns will be sufficient to access the net, but
> we could end up denying access from a privileged process in its child
> user_ns, that is, potentially, the net->userns.

We should never be taking any security check from the struct file.

All security checks are only done on current in rdma, the context of
the file opener must contribute nothing. The file opener could have
had more privilege than the child process somehow, and that extra
privilege should not leak into the child.

Even in goofy cases like passing a FD between processes with different
net namespaces, the expectation is that objects can be created
relative to net namespace of the process calling the ioctl, and then
accessed by the other process in the other namespace.

Objects should and do become bound to the net namespaces that created
them, just not the FD.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-22 12:46                       ` Jason Gunthorpe
@ 2025-04-22 13:14                         ` Serge E. Hallyn
  2025-04-22 16:11                           ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-22 13:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

Hi Jason,

On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> > > > 1. the create should check ns_capable(current->nsproxy->net->user_ns,
> > > > CAP_NET_RAW) 
> > > I believe this is sufficient as this create call happens through the ioctl().
> > > But more question on #3.
> 
> I think this is the right one to use everywhere.

It's the right one to use when creating resources, but when later using
them, since below you say that the resource should in fact be tied to
the creator's network namespace, that means that checking
current->nsproxy->net->user_ns would have nothing to do with the
resource being used, right?

> > > > 3. the ioctl should check file_ns_capable(attrs->ufile->filp->f_cred->user_ns,
> > > > CAP_NET_RAW)
> > > > 
> > > > Two notes about (3).  First, note that it's different from what you had.
> > > > It explicitly checks that the caller has CAP_NET_RAW against the net
> > > > namespace that was used to open the file.  
> > > How is the net namespace linked in #3?
> > > Is it because when file was opened, the rdma device was accessible in a given net ns?
> > > But again the net ns explicitly not accessed in #3.
> > 
> > I'll have to look around and see if we can deduce the netns from elsewhere,
> > the device perhaps.  But IIUC the file's user_ns should be the one for
> > which we checked that it has CAP_NET_RAW over the actual net->user_ns,
> > so if you have CAP_NET_RAW in that user_ns, then you're good.  Where it
> > *could* get wonky is if the opener was in a parent userns of the net->userns.
> > In that case the file's userns will be sufficient to access the net, but
> > we could end up denying access from a privileged process in its child
> > user_ns, that is, potentially, the net->userns.
> 
> We should never be taking any security check from the struct file.
> 
> All security checks are only done on current in rdma, the context of
> the file opener must contribute nothing. The file opener could have
> had more privilege than the child process somehow, and that extra
> privilege should not leak into the child.
> 
> Even in goofy cases like passing a FD between processes with different
> net namespaces, the expectation is that objects can be created
> relative to net namespace of the process calling the ioctl, and then
> accessed by the other process in the other namespace.

So when earlier it was said that uverbs was switching from read/write
to ioctl so that permissions could be checked, that is not actually
the case?  The intent is for a privileged task to create the
resource and be able to pass it to any task in any namespace with any
or no privilege and have that task be able to use it with the
opener's original privilege, just as with read/write?

> Objects should and do become bound to the net namespaces that created
> them, just not the FD.

I was trying last night to track down where the uverb ioctls are doing 
permission checks, but failing to find it.  I see where the
pbundle->method_elm->handler gets dereferenced, but not where those
are defined.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-22 13:14                         ` Serge E. Hallyn
@ 2025-04-22 16:11                           ` Jason Gunthorpe
  2025-04-22 16:29                             ` Serge E. Hallyn
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-22 16:11 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Parav Pandit, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Tue, Apr 22, 2025 at 08:14:33AM -0500, Serge E. Hallyn wrote:
> Hi Jason,
> 
> On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> > > > > 1. the create should check ns_capable(current->nsproxy->net->user_ns,
> > > > > CAP_NET_RAW) 
> > > > I believe this is sufficient as this create call happens through the ioctl().
> > > > But more question on #3.
> > 
> > I think this is the right one to use everywhere.
> 
> It's the right one to use when creating resources, but when later using
> them, since below you say that the resource should in fact be tied to
> the creator's network namespace, that means that checking
> current->nsproxy->net->user_ns would have nothing to do with the
> resource being used, right?

Yes, in that case you'd check something stored in the uobject.

This happens sort of indirectly, for instance an object may become
associated with a netdevice and the netdevice is linked to a net
namespace. Eg we should do route lookups relative to that associated
net devices's namespaces.

I'm not sure we have a capable like check like that though.

> > Even in goofy cases like passing a FD between processes with different
> > net namespaces, the expectation is that objects can be created
> > relative to net namespace of the process calling the ioctl, and then
> > accessed by the other process in the other namespace.
> 
> So when earlier it was said that uverbs was switching from read/write
> to ioctl so that permissions could be checked, that is not actually
> the case? 

I don't quite know what you mean here?

read/write has a security problem in that you can pass a FD to a
setuid program as its stdout and have that setuid program issue a
write() to trigger a kernel operation using it's elevated
privilege. This is not possible with ioctl.

When this bug was discovered the read/write path started calling
ib_safe_file_access() which blanket disallows *any* credential change
from open() to write().

ioctl removes this excessive restriction and we are back to
per-process checks.

> The intent is for a privileged task to create the
> resource and be able to pass it to any task in any namespace with any
> or no privilege and have that task be able to use it with the
> opener's original privilege, just as with read/write?

Yes. The permissions affiliate with the object contained inside the
FD, not the FD itself. The FD is just a container and a way to route
system calls.

> I was trying last night to track down where the uverb ioctls are doing 
> permission checks, but failing to find it.  I see where the
> pbundle->method_elm->handler gets dereferenced, but not where those
> are defined.

There are very few permission checks. Most boil down to implicit
things, like we have a netdevice relative to current's net namespace
and we need to find a gid table index for that netdevice. We don't
actually need to do anything special here as the ifindex code
automatically validates the namespaces and struct net_device * are
globally unique.

Similarly with route lookups and things, once we validated the net
device objects are supposed to remain bound to it.

The cases like cap_net_raw are one time checks at creation time that
modify the devices' rules for processing the queues. The devices check
the creation property of the queue when processing the queue.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-22 16:11                           ` Jason Gunthorpe
@ 2025-04-22 16:29                             ` Serge E. Hallyn
  2025-04-23 12:41                               ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-22 16:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Tue, Apr 22, 2025 at 01:11:27PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 08:14:33AM -0500, Serge E. Hallyn wrote:
> > Hi Jason,
> > 
> > On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> > > > > > 1. the create should check ns_capable(current->nsproxy->net->user_ns,
> > > > > > CAP_NET_RAW) 
> > > > > I believe this is sufficient as this create call happens through the ioctl().
> > > > > But more question on #3.
> > > 
> > > I think this is the right one to use everywhere.
> > 
> > It's the right one to use when creating resources, but when later using
> > them, since below you say that the resource should in fact be tied to
> > the creator's network namespace, that means that checking
> > current->nsproxy->net->user_ns would have nothing to do with the
> > resource being used, right?
> 
> Yes, in that case you'd check something stored in the uobject.

Perfect, that's exactly the kind of thing I was looking for.  Thanks.

> This happens sort of indirectly, for instance an object may become
> associated with a netdevice and the netdevice is linked to a net
> namespace. Eg we should do route lookups relative to that associated
> net devices's namespaces.
> 
> I'm not sure we have a capable like check like that though.
> 
> > > Even in goofy cases like passing a FD between processes with different
> > > net namespaces, the expectation is that objects can be created
> > > relative to net namespace of the process calling the ioctl, and then
> > > accessed by the other process in the other namespace.
> > 
> > So when earlier it was said that uverbs was switching from read/write
> > to ioctl so that permissions could be checked, that is not actually
> > the case? 
> 
> I don't quite know what you mean here?
> 
> read/write has a security problem in that you can pass a FD to a
> setuid program as its stdout and have that setuid program issue a
> write() to trigger a kernel operation using it's elevated
> privilege. This is not possible with ioctl.
> 
> When this bug was discovered the read/write path started calling
> ib_safe_file_access() which blanket disallows *any* credential change
> from open() to write().
> 
> ioctl removes this excessive restriction and we are back to
> per-process checks.
> 
> > The intent is for a privileged task to create the
> > resource and be able to pass it to any task in any namespace with any
> > or no privilege and have that task be able to use it with the
> > opener's original privilege, just as with read/write?
> 
> Yes. The permissions affiliate with the object contained inside the
> FD, not the FD itself. The FD is just a container and a way to route
> system calls.
> 
> > I was trying last night to track down where the uverb ioctls are doing 
> > permission checks, but failing to find it.  I see where the
> > pbundle->method_elm->handler gets dereferenced, but not where those
> > are defined.
> 
> There are very few permission checks. Most boil down to implicit
> things, like we have a netdevice relative to current's net namespace
> and we need to find a gid table index for that netdevice. We don't
> actually need to do anything special here as the ifindex code
> automatically validates the namespaces and struct net_device * are
> globally unique.
> 
> Similarly with route lookups and things, once we validated the net
> device objects are supposed to remain bound to it.
> 
> The cases like cap_net_raw are one time checks at creation time that
> modify the devices' rules for processing the queues. The devices check
> the creation property of the queue when processing the queue.

Thank you for the detailed explanation.

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-22 16:29                             ` Serge E. Hallyn
@ 2025-04-23 12:41                               ` Parav Pandit
  2025-04-23 14:46                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-23 12:41 UTC (permalink / raw)
  To: Serge E. Hallyn, Jason Gunthorpe
  Cc: Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Tuesday, April 22, 2025 10:00 PM
> 
> On Tue, Apr 22, 2025 at 01:11:27PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 22, 2025 at 08:14:33AM -0500, Serge E. Hallyn wrote:
> > > Hi Jason,
> > >
> > > On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> > > > > > > 1. the create should check
> > > > > > > ns_capable(current->nsproxy->net->user_ns,
> > > > > > > CAP_NET_RAW)
> > > > > > I believe this is sufficient as this create call happens through the
> ioctl().
> > > > > > But more question on #3.
> > > >
> > > > I think this is the right one to use everywhere.
> > >
> > > It's the right one to use when creating resources, but when later
> > > using them, since below you say that the resource should in fact be
> > > tied to the creator's network namespace, that means that checking
> > > current->nsproxy->net->user_ns would have nothing to do with the
> > > resource being used, right?
> >
> > Yes, in that case you'd check something stored in the uobject.
> 
> Perfect, that's exactly the kind of thing I was looking for.  Thanks.
>
It means uboject create path will refcount and store user_ns, 

uobject->user_ns = get_user_ns(current->nsproxy->net->user_ns);

And uobject destroy will do,
	put_user_ns(uobject->user_ns).

This will ensure that in below flow we won't have use_after_free.
1. process_A created object in user_ns_A
2. process_A shared fd with process_B in user_ns_B
3. process_A is killed and
4. user_ns_A is free is attempted (free is skipped, until uobject is destroyed by process_B).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-23 12:41                               ` Parav Pandit
@ 2025-04-23 14:46                                 ` Jason Gunthorpe
  2025-04-23 15:43                                   ` Eric W. Biederman
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 14:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Wed, Apr 23, 2025 at 12:41:26PM +0000, Parav Pandit wrote:
> 
> > From: Serge E. Hallyn <serge@hallyn.com>
> > Sent: Tuesday, April 22, 2025 10:00 PM
> > 
> > On Tue, Apr 22, 2025 at 01:11:27PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 22, 2025 at 08:14:33AM -0500, Serge E. Hallyn wrote:
> > > > Hi Jason,
> > > >
> > > > On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> > > > > > > > 1. the create should check
> > > > > > > > ns_capable(current->nsproxy->net->user_ns,
> > > > > > > > CAP_NET_RAW)
> > > > > > > I believe this is sufficient as this create call happens through the
> > ioctl().
> > > > > > > But more question on #3.
> > > > >
> > > > > I think this is the right one to use everywhere.
> > > >
> > > > It's the right one to use when creating resources, but when later
> > > > using them, since below you say that the resource should in fact be
> > > > tied to the creator's network namespace, that means that checking
> > > > current->nsproxy->net->user_ns would have nothing to do with the
> > > > resource being used, right?
> > >
> > > Yes, in that case you'd check something stored in the uobject.
> > 
> > Perfect, that's exactly the kind of thing I was looking for.  Thanks.
> >
> It means uboject create path will refcount and store user_ns, 
> 
> uobject->user_ns = get_user_ns(current->nsproxy->net->user_ns);
> 
> And uobject destroy will do,
> 	put_user_ns(uobject->user_ns).
> 
> This will ensure that in below flow we won't have use_after_free.
> 1. process_A created object in user_ns_A
> 2. process_A shared fd with process_B in user_ns_B
> 3. process_A is killed and
> 4. user_ns_A is free is attempted (free is skipped, until uobject is destroyed by process_B).

We only need to do that if something is legimitately doing capable
from a uobject outside of creation? Did you find that?

And I wonder if using the uobjects affiliated netdev's namespace is
OK?

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-23 14:46                                 ` Jason Gunthorpe
@ 2025-04-23 15:43                                   ` Eric W. Biederman
  2025-04-23 15:56                                     ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-23 15:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Wed, Apr 23, 2025 at 12:41:26PM +0000, Parav Pandit wrote:
>> 
>> > From: Serge E. Hallyn <serge@hallyn.com>
>> > Sent: Tuesday, April 22, 2025 10:00 PM
>> > 
>> > On Tue, Apr 22, 2025 at 01:11:27PM -0300, Jason Gunthorpe wrote:
>> > > On Tue, Apr 22, 2025 at 08:14:33AM -0500, Serge E. Hallyn wrote:
>> > > > Hi Jason,
>> > > >
>> > > > On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
>> > > > > On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
>> > > > > > > > 1. the create should check
>> > > > > > > > ns_capable(current->nsproxy->net->user_ns,
>> > > > > > > > CAP_NET_RAW)
>> > > > > > > I believe this is sufficient as this create call happens through the
>> > ioctl().
>> > > > > > > But more question on #3.
>> > > > >
>> > > > > I think this is the right one to use everywhere.
>> > > >
>> > > > It's the right one to use when creating resources, but when later
>> > > > using them, since below you say that the resource should in fact be
>> > > > tied to the creator's network namespace, that means that checking
>> > > > current->nsproxy->net->user_ns would have nothing to do with the
>> > > > resource being used, right?
>> > >
>> > > Yes, in that case you'd check something stored in the uobject.
>> > 
>> > Perfect, that's exactly the kind of thing I was looking for.  Thanks.
>> >
>> It means uboject create path will refcount and store user_ns, 
>> 
>> uobject->user_ns = get_user_ns(current->nsproxy->net->user_ns);
>> 
>> And uobject destroy will do,
>> 	put_user_ns(uobject->user_ns).
>> 
>> This will ensure that in below flow we won't have use_after_free.
>> 1. process_A created object in user_ns_A
>> 2. process_A shared fd with process_B in user_ns_B
>> 3. process_A is killed and
>> 4. user_ns_A is free is attempted (free is skipped, until uobject is destroyed by process_B).
>
> We only need to do that if something is legimitately doing capable
> from a uobject outside of creation? Did you find that?

I believe the proposed change that started this discussion,
was to make rdma usable inside of a user namespace.

Which led to the question: Are the current capable calls safe and
correct, as they aren't preserving the context that can with opening a
file descriptor?  If I have skimmed this thread correctly the answer
not preserving the opener's context is a seriously atypical but
deliberate choice.

> And I wonder if using the uobjects affiliated netdev's namespace is
> OK?

That is actually preferable.  It is what I updated the rest of the
network stack to do.  I don't know if you would use dev_net or
something else.

Going back to the original proposal I don't know how ready the code is
to handle callers that are not root.  This is both a question of
semantics (is it safe in theory) and a question of implementation (are
there unfixed bugs that no one cares about because only root has been
using the code).

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-23 15:43                                   ` Eric W. Biederman
@ 2025-04-23 15:56                                     ` Parav Pandit
  2025-04-23 16:45                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-23 15:56 UTC (permalink / raw)
  To: Eric W. Biederman, Jason Gunthorpe
  Cc: Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

> From: Eric W. Biederman <ebiederm@xmission.com>
> Sent: Wednesday, April 23, 2025 9:14 PM
> 
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > On Wed, Apr 23, 2025 at 12:41:26PM +0000, Parav Pandit wrote:
> >>
> >> > From: Serge E. Hallyn <serge@hallyn.com>
> >> > Sent: Tuesday, April 22, 2025 10:00 PM
> >> >
> >> > On Tue, Apr 22, 2025 at 01:11:27PM -0300, Jason Gunthorpe wrote:
> >> > > On Tue, Apr 22, 2025 at 08:14:33AM -0500, Serge E. Hallyn wrote:
> >> > > > Hi Jason,
> >> > > >
> >> > > > On Tue, Apr 22, 2025 at 09:46:40AM -0300, Jason Gunthorpe wrote:
> >> > > > > On Mon, Apr 21, 2025 at 12:22:36PM -0500, Serge E. Hallyn wrote:
> >> > > > > > > > 1. the create should check
> >> > > > > > > > ns_capable(current->nsproxy->net->user_ns,
> >> > > > > > > > CAP_NET_RAW)
> >> > > > > > > I believe this is sufficient as this create call happens
> >> > > > > > > through the
> >> > ioctl().
> >> > > > > > > But more question on #3.
> >> > > > >
> >> > > > > I think this is the right one to use everywhere.
> >> > > >
> >> > > > It's the right one to use when creating resources, but when
> >> > > > later using them, since below you say that the resource should
> >> > > > in fact be tied to the creator's network namespace, that means
> >> > > > that checking
> >> > > > current->nsproxy->net->user_ns would have nothing to do with
> >> > > > current->nsproxy->net->the
> >> > > > resource being used, right?
> >> > >
> >> > > Yes, in that case you'd check something stored in the uobject.
> >> >
> >> > Perfect, that's exactly the kind of thing I was looking for.  Thanks.
> >> >
> >> It means uboject create path will refcount and store user_ns,
> >>
> >> uobject->user_ns = get_user_ns(current->nsproxy->net->user_ns);
> >>
> >> And uobject destroy will do,
> >> 	put_user_ns(uobject->user_ns).
> >>
> >> This will ensure that in below flow we won't have use_after_free.
> >> 1. process_A created object in user_ns_A 2. process_A shared fd with
> >> process_B in user_ns_B 3. process_A is killed and 4. user_ns_A is
> >> free is attempted (free is skipped, until uobject is destroyed by process_B).
> >
> > We only need to do that if something is legimitately doing capable
> > from a uobject outside of creation? Did you find that?
> 
> I believe the proposed change that started this discussion, was to make rdma
> usable inside of a user namespace.
> 
Answering both you and Jason above question.
It will be only specific uobjects which demands the CAP_X will have user_ns reference.
Rest continue as_is.

> Which led to the question: Are the current capable calls safe and correct, as
> they aren't preserving the context that can with opening a file descriptor?  If I
> have skimmed this thread correctly the answer not preserving the opener's
> context is a seriously atypical but deliberate choice.
> 
> > And I wonder if using the uobjects affiliated netdev's namespace is
> > OK?
>
We don't refer to the netdev of the rdma. Because netdev is not there in many cases.
Its just rdma device.

We always refer to the net ns of the currently running process.
And a process is able to access the rdma because the process and rdma device are in same net ns.


> That is actually preferable.  It is what I updated the rest of the network stack
> to do.  I don't know if you would use dev_net or something else.
>
Current->nxproxy->net->user_ns should be sufficient to refer as both the device and process are in same net ns.
Moreover it's the capability of the process being accessed, not really the rdma device.
 
> Going back to the original proposal I don't know how ready the code is to
> handle callers that are not root.  This is both a question of semantics (is it safe
> in theory) and a question of implementation (are there unfixed bugs that no
> one cares about because only root has been using the code).
> 
> Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-23 15:56                                     ` Parav Pandit
@ 2025-04-23 16:45                                       ` Jason Gunthorpe
  2025-04-24  9:08                                         ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 16:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Eric W. Biederman, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Wed, Apr 23, 2025 at 03:56:39PM +0000, Parav Pandit wrote:
> > > And I wonder if using the uobjects affiliated netdev's namespace is
> > > OK?
> >
> We don't refer to the netdev of the rdma. Because netdev is not there in many cases.
> Its just rdma device.

The ib_device itself also has a net namespace these days.

I really worry that a single uobject has too many choices for the namespace:

 1) The one provided by current during a system call
 2) The one that was active in current when the uobject was created
 3) The one that is linked to a netdev associated with the uobject when it was created
 4) The one that is linked to the ufile's underlying ib_device
 5) The one that was active in current when the ufile was opened.

In all practical cases we expect that all of the above are the same
thing, so this is looking at fringe cases where userspace is changing
the namespaces during the lifecycle of the FD.

So.. Some basic questions.

Since ib_device has a namespace and ufile is tied to a ib_device, can
we ever have a situation where the ib_device has a different namespace
than the ufile's? This would mean we changed the namespace of the
ib_device, and IIRC, that means we revoked/disassociated the ufile? So
the answer is no? This means #4 and #5 are the same thing.

Can a uobject affiliated netdev have a different namespace than the
ib_device? The netdevs arise from the gid table, and the gid table
population should strictly follow the ib_device namespace, yes? So, I
think the answer is generally no, but there are going to be transient
cases where a gid table entry is in progress to delete while a netdev
is moving to another namespace? This means #3/#4/#5 are the same
thing.

Can current have a different namespace than the ib_device? I guess
yes, the FD can be passed around. However this would mean that the FD
caller should not be able to get any gid table handles as none of its
ifindexes will work. So #1 is != #3/#4/#5

And finally the FD can be passed around after the uobject is created
so #2 != #1.

So, I would say the correct namespace path to use depends entirely on
what it is you are checking.

1) During uobject creation CAP_NET_RAW is checked against current.
   Perhaps we should further insist that current == ib_device's NS
   as well?
2) During gid_table lookup for any reason. Use current to translate
   the ifindex to a netdevice. Match the netdevice against the gid
   table. Effectively fails if current != ib_device's NS.
3) Routing lookups/etc should use the namespace of the netdevice of
   the gid index being looked up.

What other NS users are there?

> > Going back to the original proposal I don't know how ready the code is to
> > handle callers that are not root.  This is both a question of semantics (is it safe
> > in theory) and a question of implementation (are there unfixed bugs that no
> > one cares about because only root has been using the code).

We need to look at each change, but I think most of it is fine.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-23 16:45                                       ` Jason Gunthorpe
@ 2025-04-24  9:08                                         ` Parav Pandit
  2025-04-24 14:13                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-24  9:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric W. Biederman, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 23, 2025 10:16 PM
> 
> On Wed, Apr 23, 2025 at 03:56:39PM +0000, Parav Pandit wrote:
> > > > And I wonder if using the uobjects affiliated netdev's namespace
> > > > is OK?
> > >
> > We don't refer to the netdev of the rdma. Because netdev is not there in
> many cases.
> > Its just rdma device.
> 
> The ib_device itself also has a net namespace these days.
> 
> I really worry that a single uobject has too many choices for the namespace:
> 
>  1) The one provided by current during a system call
>  2) The one that was active in current when the uobject was created
>  3) The one that is linked to a netdev associated with the uobject when it was
> created
>  4) The one that is linked to the ufile's underlying ib_device
>  5) The one that was active in current when the ufile was opened.
> 
> In all practical cases we expect that all of the above are the same thing, so this
> is looking at fringe cases where userspace is changing the namespaces during
> the lifecycle of the FD.
> 
> So.. Some basic questions.
> 
> Since ib_device has a namespace and ufile is tied to a ib_device, can we ever
> have a situation where the ib_device has a different namespace than the
> ufile's? This would mean we changed the namespace of the ib_device, and
> IIRC, that means we revoked/disassociated the ufile? So the answer is no?
> This means #4 and #5 are the same thing.
>
Right.
 
> Can a uobject affiliated netdev have a different namespace than the
> ib_device? 
When a uobject when created, it is not affiliated to netdev.
In many cases, a rdma device does not even have netdev at all.
Some examples of it are, IB device without ipoib, efa, an SF rdma device.
So uboject does not need to worry about netdev at all.
At best it is the net ns of #1 or #2.

> The netdevs arise from the gid table, and the gid table population
> should strictly follow the ib_device namespace, yes? 
I wish it this way, but unfortunately, rdma still have ancient shared mode for example single rdma device + macvlan.
Until that is deprecated, let the gid table entry's netdev drives the QP modify as done today.

> So, I think the answer is
> generally no, but there are going to be transient cases where a gid table entry
> is in progress to delete while a netdev is moving to another namespace? This
> means #3/#4/#5 are the same thing.
>
True. 
 
> Can current have a different namespace than the ib_device? I guess yes, the
> FD can be passed around. However this would mean that the FD caller should
> not be able to get any gid table handles as none of its ifindexes will work. So
> #1 is != #3/#4/#5
> 
Well, it can pass the fd after the ifindex is resolved (i.e. after modify_qp).
If fd is passed before modify qp in different net ns, its can get access too because rdma device got shared.
But that is the case with raw socket too.
The difference is, every send() call checks the ifindex, vs here its checked when raw qp is created.

We can add the additional check in the sysfs and in modify qp, but very long ago (2019), we envisioned that users should use only the exclusive mode.
And hence, those checks were not added.

> And finally the FD can be passed around after the uobject is created so #2 !=
> #1.
>
Right. So the optimal place to attach to user_ns seems #2.
 
> So, I would say the correct namespace path to use depends entirely on what it
> is you are checking.
>
When the uobject is created, that's where the enforcement should happen against the current process.
 
> 1) During uobject creation CAP_NET_RAW is checked against current.
>    Perhaps we should further insist that current == ib_device's NS
>    as well?
This can be done when we are in exclusive mode which is not the default to avoid breaking backward compatibility.

> 2) During gid_table lookup for any reason. Use current to translate
>    the ifindex to a netdevice. Match the netdevice against the gid
>    table.
We always use the ifindex and net_ns of the netdev of the GID. So this is ok.

> Effectively fails if current != ib_device's NS.
Only possible when in exclusive mode.

> 3) Routing lookups/etc should use the namespace of the netdevice of
>    the gid index being looked up.
> 
This is already done.

> What other NS users are there?
Incoming rx IB mad packets are looked up in the GID's attached netdev's net ns.
In-kernel ulps (nfs, smc) do not seem to have the interest, but they do not created uobjects nor they access any uverbs fd.
So they will be protected differently anyway.
I had patches in 2019 time where kernel caller passes the net_ns as they hold the reference to it.
But no interest in users of it, so skipped it.
Last time someone from Western Digital reached out for nvme fabrics but no_show after that.

Regardless, those users are unrelated to user_ns uobjects.
> 
> > > Going back to the original proposal I don't know how ready the code
> > > is to handle callers that are not root.  This is both a question of
> > > semantics (is it safe in theory) and a question of implementation
> > > (are there unfixed bugs that no one cares about because only root has
> been using the code).
> 
> We need to look at each change, but I think most of it is fine.
> 
> Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-24  9:08                                         ` Parav Pandit
@ 2025-04-24 14:13                                           ` Jason Gunthorpe
  2025-04-25 13:14                                             ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-24 14:13 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Eric W. Biederman, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Thu, Apr 24, 2025 at 09:08:17AM +0000, Parav Pandit wrote:
> > Since ib_device has a namespace and ufile is tied to a ib_device, can we ever
> > have a situation where the ib_device has a different namespace than the
> > ufile's? This would mean we changed the namespace of the ib_device, and
> > IIRC, that means we revoked/disassociated the ufile? So the answer is no?
> > This means #4 and #5 are the same thing.
> >
> Right.
>  
> > Can a uobject affiliated netdev have a different namespace than the
> > ib_device? 
> When a uobject when created, it is not affiliated to netdev.

I'm asking about when it does have a netdev. When you create/modify a
QP and give it a gid index, for instance.

> > The netdevs arise from the gid table, and the gid table population
> > should strictly follow the ib_device namespace, yes?

> I wish it this way, but unfortunately, rdma still have ancient
> shared mode for example single rdma device + macvlan.  Until that is
> deprecated, let the gid table entry's netdev drives the QP modify as
> done today.

I have been ignoring shared mode in all of this analysis. I don't
think you can make sane statements about container security in shared
mode.

> > Can current have a different namespace than the ib_device? I guess yes, the
> > FD can be passed around. However this would mean that the FD caller should
> > not be able to get any gid table handles as none of its ifindexes will work. So
> > #1 is != #3/#4/#5
> 
> Well, it can pass the fd after the ifindex is resolved (i.e. after modify_qp).
> If fd is passed before modify qp in different net ns, its can get access too because rdma device got shared.

That's all fine. The uobject retains its affiliated netdev.

> But that is the case with raw socket too.  The difference is, every
> send() call checks the ifindex, vs here its checked when raw qp is
> created.

Also I think fine

> We can add the additional check in the sysfs and in modify qp, but
> very long ago (2019), we envisioned that users should use only the
> exclusive mode.  And hence, those checks were not added.

I think we should ignore shared mode, it doesn't work sanely with
namespaces.

> > What other NS users are there?

> Incoming rx IB mad packets are looked up in the GID's attached netdev's net ns.

Ultimately a GID index should not be delivered to a userspace that
does not have that GID index in the objects affiliated net namespace.
I wonder if we are missing some validation here

> In-kernel ulps (nfs, smc) do not seem to have the interest, but they
> do not created uobjects nor they access any uverbs fd.

IIRC we have open issues with NFS/SRP/etc and namespaces, the kernel
ULP doesn't have a way to use a namespace?

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-24 14:13                                           ` Jason Gunthorpe
@ 2025-04-25 13:14                                             ` Parav Pandit
  2025-04-25 13:29                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-25 13:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric W. Biederman, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 24, 2025 7:44 PM
> To: Parav Pandit <parav@nvidia.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>; Serge E. Hallyn
> <serge@hallyn.com>; linux-rdma@vger.kernel.org; linux-security-
> module@vger.kernel.org; Leon Romanovsky <leonro@nvidia.com>
> Subject: Re: [PATCH] RDMA/uverbs: Consider capability of the process that
> opens the file
> 
> On Thu, Apr 24, 2025 at 09:08:17AM +0000, Parav Pandit wrote:
> > > Since ib_device has a namespace and ufile is tied to a ib_device,
> > > can we ever have a situation where the ib_device has a different
> > > namespace than the ufile's? This would mean we changed the
> namespace
> > > of the ib_device, and IIRC, that means we revoked/disassociated the ufile?
> So the answer is no?
> > > This means #4 and #5 are the same thing.
> > >
> > Right.
> >
> > > Can a uobject affiliated netdev have a different namespace than the
> > > ib_device?
> > When a uobject when created, it is not affiliated to netdev.
> 
> I'm asking about when it does have a netdev. When you create/modify a QP
> and give it a gid index, for instance.
We don't have a check.
Usually once the rdma device and associated uverbs char device are assigned to a respective net and mount ns.
So to protect against such privileged user error, net ns enforcement would be good to add for exclusive mode.

More below.
> 
> > > The netdevs arise from the gid table, and the gid table population
> > > should strictly follow the ib_device namespace, yes?
> 
> > I wish it this way, but unfortunately, rdma still have ancient shared
> > mode for example single rdma device + macvlan.  Until that is
> > deprecated, let the gid table entry's netdev drives the QP modify as
> > done today.
> 
> I have been ignoring shared mode in all of this analysis. I don't think you can
> make sane statements about container security in shared mode.
> 
True. one option is to check in modify_qp() and other friend callers net ns checks 
with netdev and current->nsproxy->net.
(and not of ibdev).

Or 2nd option (preferred) is: to perform the checks against only in the exclusive mode, 
against the net_ns of the ibdev.
Finding this better, as you also suggest that further below.

> > > Can current have a different namespace than the ib_device? I guess
> > > yes, the FD can be passed around. However this would mean that the
> > > FD caller should not be able to get any gid table handles as none of
> > > its ifindexes will work. So
> > > #1 is != #3/#4/#5
> >
> > Well, it can pass the fd after the ifindex is resolved (i.e. after modify_qp).
> > If fd is passed before modify qp in different net ns, its can get access too
> because rdma device got shared.
> 
> That's all fine. The uobject retains its affiliated netdev.
> 
> > But that is the case with raw socket too.  The difference is, every
> > send() call checks the ifindex, vs here its checked when raw qp is
> > created.
> 
> Also I think fine
> 
> > We can add the additional check in the sysfs and in modify qp, but
> > very long ago (2019), we envisioned that users should use only the
> > exclusive mode.  And hence, those checks were not added.
> 
> I think we should ignore shared mode, it doesn't work sanely with
> namespaces.
> 
> > > What other NS users are there?
> 
> > Incoming rx IB mad packets are looked up in the GID's attached netdev's net
> ns.
> 
> Ultimately a GID index should not be delivered to a userspace that does not
> have that GID index in the objects affiliated net namespace.
> I wonder if we are missing some validation here
> 
I prefer to avoid holding a reference to the net namespace under uobject creation time.
I think its wrong to hold reference to net and user ns in syscall() object creation flow, but cant prove at the moment.

Better to enforce net ns checks when adding the GID.
This is where GID, netdev association occurs.
So if the netns of netdev and ibdev do not match in exclusive mode, we wont even add the GID.

And for some reason if netdev moves out of net ns, we remove such GID entries using usual callback via del_gid.

> > In-kernel ulps (nfs, smc) do not seem to have the interest, but they
> > do not created uobjects nor they access any uverbs fd.
> 
> IIRC we have open issues with NFS/SRP/etc and namespaces, the kernel ULP
> doesn't have a way to use a namespace?
> 
Nfs and srp initiator has net ns awareness. Rest all are in default net ns.

> Jason

Maybe I missed the user ns conclusion in discussing net ns enforcement.

So let me summarize my understanding:

1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
And we don't hold any reference for user ns.
This will be only done for the selected objects who need cap enforcement.
Can we proceed with this for user ns cap enforcement?

2. For net ns protection in exclusive mode, few enforcements to be done for 
ib device modify_qp, sysfs, gid query. This will be a separate, unrelated patch(es) to user ns.

3. Do not enforce things in shared net ns mode.

For #1 and #2, will send two different patch set.

Does this path look ok?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 13:14                                             ` Parav Pandit
@ 2025-04-25 13:29                                               ` Jason Gunthorpe
  2025-04-25 13:54                                                 ` Parav Pandit
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-25 13:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Eric W. Biederman, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:

> 1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
> And we don't hold any reference for user ns.

This is the thing that makes my head ache.. Is that really the right
way to get the user_ns of current? Is it possible that current has
multiple user_ns's? We are picking nsproxy because ib_dev has a net
namespace affiliation?

> This will be only done for the selected objects who need cap enforcement.
> Can we proceed with this for user ns cap enforcement?
> 
> 2. For net ns protection in exclusive mode, few enforcements to be done for 
> ib device modify_qp, sysfs, gid query. This will be a separate, unrelated patch(es) to user ns.
> 
> 3. Do not enforce things in shared net ns mode.
> 
> For #1 and #2, will send two different patch set.
> 
> Does this path look ok?

Yes

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 13:29                                               ` Jason Gunthorpe
@ 2025-04-25 13:54                                                 ` Parav Pandit
  2025-04-25 14:06                                                   ` Serge E. Hallyn
  2025-04-25 13:59                                                 ` Serge E. Hallyn
  2025-04-25 14:01                                                 ` Serge E. Hallyn
  2 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-25 13:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric W. Biederman, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 25, 2025 7:00 PM
> 
> On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> 
> > 1. In uobject creation syscall, I will add the check current->nsproxy->net-
> >user_ns capability using ns_capable().
> > And we don't hold any reference for user ns.
> 
> This is the thing that makes my head ache.. Is that really the right way to get
> the user_ns of current? 

> Is it possible that current has multiple user_ns's? 
I don't think so.

> We
> are picking nsproxy because ib_dev has a net namespace affiliation?
> 
Yes.

After ruling out file's user ns, I believe there are two user ns.

1. current_user_ns() 
2. current->nsproxy->net->user_ns.

In most cases #1 and #2 should be same to my knowledge.

When/if user wants to do have nested user ns, and don't want to create a new net ns, #2 can be of use.
For example,
a. Process1 starts in user_ns_1 which created net_ns_1
b. rdma device is in net_ns_1
c. Process1 unshare and moves to user_ns_2.
d. For some reason user_ns_2 does not have the cap.

By current UTS and other namespace semantics, since rdma device belongs to net ns, net ns's creator user ns to be considered.

I am unsure if doing #1 breaks any existing model.
I like to get Eric/Serge's view also, if we should consider #1 or #2.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 13:29                                               ` Jason Gunthorpe
  2025-04-25 13:54                                                 ` Parav Pandit
@ 2025-04-25 13:59                                                 ` Serge E. Hallyn
  2025-04-25 14:01                                                 ` Serge E. Hallyn
  2 siblings, 0 replies; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-25 13:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Eric W. Biederman, Serge E. Hallyn,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> 
> > 1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
> > And we don't hold any reference for user ns.
> 
> This is the thing that makes my head ache.. Is that really the right
> way to get the user_ns of current? Is it possible that current has
> multiple user_ns's? We are picking nsproxy because ib_dev has a net
> namespace affiliation?

It's making my head ache too, but I think for slightly different reason.
If the device is not going to be tied to a network or user namespace yet,
as I believe Parav is saying, than what on earth are we checking caps for
at all?  With capable(CAP_NET_X) it makes sense since that is system wide.
But if the check is going to be against a namespace, then it seems
meaningless.  AFAICS I can just create a new userns and a new empty netns,
create the resource with the privilege I have, then use that resource from
the init_net_ns/init_user_ns.

Even if there will be subsequent checks when using it, the question remains
what was the point of the check at creation time.

IMO either the net or the user_ns needs to be stashed at creation time,
because that's the thing to which access has been granted.

> > This will be only done for the selected objects who need cap enforcement.
> > Can we proceed with this for user ns cap enforcement?
> > 
> > 2. For net ns protection in exclusive mode, few enforcements to be done for 
> > ib device modify_qp, sysfs, gid query. This will be a separate, unrelated patch(es) to user ns.
> > 
> > 3. Do not enforce things in shared net ns mode.
> > 
> > For #1 and #2, will send two different patch set.
> > 
> > Does this path look ok?
> 
> Yes
> 
> Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 13:29                                               ` Jason Gunthorpe
  2025-04-25 13:54                                                 ` Parav Pandit
  2025-04-25 13:59                                                 ` Serge E. Hallyn
@ 2025-04-25 14:01                                                 ` Serge E. Hallyn
  2025-04-25 14:24                                                   ` Jason Gunthorpe
  2 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-25 14:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Eric W. Biederman, Serge E. Hallyn,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> 
> > 1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
> > And we don't hold any reference for user ns.
> 
> This is the thing that makes my head ache.. Is that really the right
> way to get the user_ns of current? Is it possible that current has
> multiple user_ns's? We are picking nsproxy because ib_dev has a net
> namespace affiliation?

It's not that "current has multiple user_ns's", it's that the various
resources, including other namespaces, which current has or belongs
to have associated namespaces.

current_user_ns() is the user namespace to which current belongs.
But if you want to check if it can have privilege over a resource,
you have to check whether current has ns_capable(resource->userns, CAP_X).

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 13:54                                                 ` Parav Pandit
@ 2025-04-25 14:06                                                   ` Serge E. Hallyn
  2025-04-25 15:05                                                     ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-25 14:06 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Eric W. Biederman, Serge E. Hallyn,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Fri, Apr 25, 2025 at 01:54:07PM +0000, Parav Pandit wrote:
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, April 25, 2025 7:00 PM
> > 
> > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> > 
> > > 1. In uobject creation syscall, I will add the check current->nsproxy->net-
> > >user_ns capability using ns_capable().
> > > And we don't hold any reference for user ns.
> > 
> > This is the thing that makes my head ache.. Is that really the right way to get
> > the user_ns of current? 
> 
> > Is it possible that current has multiple user_ns's? 
> I don't think so.
> 
> > We
> > are picking nsproxy because ib_dev has a net namespace affiliation?
> > 
> Yes.
> 
> After ruling out file's user ns, I believe there are two user ns.
> 
> 1. current_user_ns() 
> 2. current->nsproxy->net->user_ns.
> 
> In most cases #1 and #2 should be same to my knowledge.
> 
> When/if user wants to do have nested user ns, and don't want to create a new net ns, #2 can be of use.
> For example,
> a. Process1 starts in user_ns_1 which created net_ns_1
> b. rdma device is in net_ns_1
> c. Process1 unshare and moves to user_ns_2.
> d. For some reason user_ns_2 does not have the cap.

(d) is important.  "user_ns_2 does not have the cap" is imprecise.  Process1
after the unshare does have the cap against user_ns_2.  It does not have
it against user_ns_1, and since net_ns_1->user_ns == user_ns_1, that
means it loses privilege over net_ns_1.  Which is what we need.  Because
otherwise, an unprivileged user could simply unshare the user_ns, be root
there, and now tweak networking.

This all stems from the original requirements for user namespaces, which
were (off top of my head)

* unprivileged users must be able to create user namespaces
* root in a user namespace must be privileged over its resources
* root in a user namespace must have no privilege over any other resources
* user namespaces must nest

> By current UTS and other namespace semantics, since rdma device belongs to net ns, net ns's creator user ns to be considered.
> 
> I am unsure if doing #1 breaks any existing model.
> I like to get Eric/Serge's view also, if we should consider #1 or #2.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 14:01                                                 ` Serge E. Hallyn
@ 2025-04-25 14:24                                                   ` Jason Gunthorpe
  2025-04-25 15:06                                                     ` Serge E. Hallyn
  2025-04-25 15:32                                                     ` Eric W. Biederman
  0 siblings, 2 replies; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-25 14:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Parav Pandit, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Fri, Apr 25, 2025 at 09:01:44AM -0500, Serge E. Hallyn wrote:
> On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> > 
> > > 1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
> > > And we don't hold any reference for user ns.
> > 
> > This is the thing that makes my head ache.. Is that really the right
> > way to get the user_ns of current? Is it possible that current has
> > multiple user_ns's? We are picking nsproxy because ib_dev has a net
> > namespace affiliation?
> 
> It's not that "current has multiple user_ns's", it's that the various
> resources, including other namespaces, which current has or belongs
> to have associated namespaces.

That seems like splitting nits. Can I do current->XXX->user_ns and get
different answers? Sounds like yes?

> current_user_ns() is the user namespace to which current belongs.
> But if you want to check if it can have privilege over a resource,
> you have to check whether current has ns_capable(resource->userns, CAP_X).

So what is the resource here?

It is definitely not the file descriptor.

Is it the kernel's struct ib_device? It has a netns that is captured
at its creation time.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 14:06                                                   ` Serge E. Hallyn
@ 2025-04-25 15:05                                                     ` Parav Pandit
  2025-04-25 15:29                                                       ` Serge E. Hallyn
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-25 15:05 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jason Gunthorpe, Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky



> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Friday, April 25, 2025 7:37 PM
> 
> On Fri, Apr 25, 2025 at 01:54:07PM +0000, Parav Pandit wrote:
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, April 25, 2025 7:00 PM
> > >
> > > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> > >
> > > > 1. In uobject creation syscall, I will add the check
> > > >current->nsproxy->net- user_ns capability using ns_capable().
> > > > And we don't hold any reference for user ns.
> > >
> > > This is the thing that makes my head ache.. Is that really the right
> > > way to get the user_ns of current?
> >
> > > Is it possible that current has multiple user_ns's?
> > I don't think so.
> >
> > > We
> > > are picking nsproxy because ib_dev has a net namespace affiliation?
> > >
> > Yes.
> >
> > After ruling out file's user ns, I believe there are two user ns.
> >
> > 1. current_user_ns()
> > 2. current->nsproxy->net->user_ns.
> >
> > In most cases #1 and #2 should be same to my knowledge.
> >
> > When/if user wants to do have nested user ns, and don't want to create a
> new net ns, #2 can be of use.
> > For example,
> > a. Process1 starts in user_ns_1 which created net_ns_1 b. rdma device
> > is in net_ns_1 c. Process1 unshare and moves to user_ns_2.
> > d. For some reason user_ns_2 does not have the cap.
> 
> (d) is important.  "user_ns_2 does not have the cap" is imprecise.  Process1
> after the unshare does have the cap against user_ns_2.  It does not have it
> against user_ns_1, and since net_ns_1->user_ns == user_ns_1, that means it
> loses privilege over net_ns_1.  Which is what we need.  Because otherwise, an
> unprivileged user could simply unshare the user_ns, be root there, and now
> tweak networking.
>
Effectively, since process_1 lost the privilege in user_ns_1 after step #c, 
if user wants to tweak networking, it must have a rdma device in net_ns_2, created by the user_ns_2.
Right?
 
> This all stems from the original requirements for user namespaces, which
> were (off top of my head)
> 
> * unprivileged users must be able to create user namespaces
> * root in a user namespace must be privileged over its resources
> * root in a user namespace must have no privilege over any other resources
> * user namespaces must nest
> 
> > By current UTS and other namespace semantics, since rdma device belongs
> to net ns, net ns's creator user ns to be considered.
> >
> > I am unsure if doing #1 breaks any existing model.
> > I like to get Eric/Serge's view also, if we should consider #1 or #2.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 14:24                                                   ` Jason Gunthorpe
@ 2025-04-25 15:06                                                     ` Serge E. Hallyn
  2025-04-25 15:27                                                       ` Parav Pandit
  2025-04-25 15:32                                                     ` Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-25 15:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Fri, Apr 25, 2025 at 11:24:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 25, 2025 at 09:01:44AM -0500, Serge E. Hallyn wrote:
> > On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> > > 
> > > > 1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
> > > > And we don't hold any reference for user ns.
> > > 
> > > This is the thing that makes my head ache.. Is that really the right
> > > way to get the user_ns of current? Is it possible that current has
> > > multiple user_ns's? We are picking nsproxy because ib_dev has a net
> > > namespace affiliation?
> > 
> > It's not that "current has multiple user_ns's", it's that the various
> > resources, including other namespaces, which current has or belongs
> > to have associated namespaces.
> 
> That seems like splitting nits. Can I do current->XXX->user_ns and get
> different answers? Sounds like yes?

I don't think it's splitting nits.  current->nsproxy->net_ns->user_ns
is not current's user namespace.

> > current_user_ns() is the user namespace to which current belongs.
> > But if you want to check if it can have privilege over a resource,
> > you have to check whether current has ns_capable(resource->userns, CAP_X).
> 
> So what is the resource here?

That's what I've been trying to get answered :)

> It is definitely not the file descriptor.
> 
> Is it the kernel's struct ib_device? It has a netns that is captured
> at its creation time.

I think that's what you suggested before, and it sounds like the
right answer to me.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 15:06                                                     ` Serge E. Hallyn
@ 2025-04-25 15:27                                                       ` Parav Pandit
  2025-04-25 15:46                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-25 15:27 UTC (permalink / raw)
  To: Serge E. Hallyn, Jason Gunthorpe
  Cc: Eric W. Biederman, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky



> From: Serge E. Hallyn <serge@hallyn.com>
> Sent: Friday, April 25, 2025 8:37 PM
> 
> On Fri, Apr 25, 2025 at 11:24:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 25, 2025 at 09:01:44AM -0500, Serge E. Hallyn wrote:
> > > On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> > > >
> > > > > 1. In uobject creation syscall, I will add the check current->nsproxy-
> >net->user_ns capability using ns_capable().
> > > > > And we don't hold any reference for user ns.
> > > >
> > > > This is the thing that makes my head ache.. Is that really the
> > > > right way to get the user_ns of current? Is it possible that
> > > > current has multiple user_ns's? We are picking nsproxy because
> > > > ib_dev has a net namespace affiliation?
> > >
> > > It's not that "current has multiple user_ns's", it's that the
> > > various resources, including other namespaces, which current has or
> > > belongs to have associated namespaces.
> >
> > That seems like splitting nits. Can I do current->XXX->user_ns and get
> > different answers? Sounds like yes?
> 
> I don't think it's splitting nits.  current->nsproxy->net_ns->user_ns is not
> current's user namespace.
> 
> > > current_user_ns() is the user namespace to which current belongs.
> > > But if you want to check if it can have privilege over a resource,
> > > you have to check whether current has ns_capable(resource->userns,
> CAP_X).
> >
> > So what is the resource here?
> 
> That's what I've been trying to get answered :)
>
A. When a raw socket is created to send a packet vis netdev (resource), the cap is checked against the process in [1].

[1] https://elixir.bootlin.com/linux/v6.14.3/source/net/ipv4/af_inet.c#L314
Not against the netdev's dev_net().

B. Netdev's dev_net() is crossed checked to see if current process can access netdev ifindex or not in send() call.

So resource was netdev, but literally the check is against the process cap.

And considering above is right,

I try to draw the parallels between the two types of network devices,

the check for rdma raw QP (equivalent of raw socket), 
Should check similarly against the process's net->user_ns during QP creation time.
This will match #A.

And process to rdma access check should be a separate check #B.
 
> > It is definitely not the file descriptor.
> >
> > Is it the kernel's struct ib_device? It has a netns that is captured
> > at its creation time.
> 
> I think that's what you suggested before, and it sounds like the right answer to
> me.
This also works in my view. Its slight deviation to how net stack does. 
and trying to find a good reason of why should it be net of rdma device?
Is it because socket is sw resource vs rdma is device resource?
Couldn't convince myself given that the capability is of the process and not the device.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 15:05                                                     ` Parav Pandit
@ 2025-04-25 15:29                                                       ` Serge E. Hallyn
  0 siblings, 0 replies; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-25 15:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Jason Gunthorpe, Eric W. Biederman,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Fri, Apr 25, 2025 at 03:05:06PM +0000, Parav Pandit wrote:
> 
> 
> > From: Serge E. Hallyn <serge@hallyn.com>
> > Sent: Friday, April 25, 2025 7:37 PM
> > 
> > On Fri, Apr 25, 2025 at 01:54:07PM +0000, Parav Pandit wrote:
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, April 25, 2025 7:00 PM
> > > >
> > > > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> > > >
> > > > > 1. In uobject creation syscall, I will add the check
> > > > >current->nsproxy->net- user_ns capability using ns_capable().
> > > > > And we don't hold any reference for user ns.
> > > >
> > > > This is the thing that makes my head ache.. Is that really the right
> > > > way to get the user_ns of current?
> > >
> > > > Is it possible that current has multiple user_ns's?
> > > I don't think so.
> > >
> > > > We
> > > > are picking nsproxy because ib_dev has a net namespace affiliation?
> > > >
> > > Yes.
> > >
> > > After ruling out file's user ns, I believe there are two user ns.
> > >
> > > 1. current_user_ns()
> > > 2. current->nsproxy->net->user_ns.
> > >
> > > In most cases #1 and #2 should be same to my knowledge.
> > >
> > > When/if user wants to do have nested user ns, and don't want to create a
> > new net ns, #2 can be of use.
> > > For example,
> > > a. Process1 starts in user_ns_1 which created net_ns_1 b. rdma device
> > > is in net_ns_1 c. Process1 unshare and moves to user_ns_2.
> > > d. For some reason user_ns_2 does not have the cap.
> > 
> > (d) is important.  "user_ns_2 does not have the cap" is imprecise.  Process1
> > after the unshare does have the cap against user_ns_2.  It does not have it
> > against user_ns_1, and since net_ns_1->user_ns == user_ns_1, that means it
> > loses privilege over net_ns_1.  Which is what we need.  Because otherwise, an
> > unprivileged user could simply unshare the user_ns, be root there, and now
> > tweak networking.
> >
> Effectively, since process_1 lost the privilege in user_ns_1 after step #c, 
> if user wants to tweak networking, it must have a rdma device in net_ns_2, created by the user_ns_2.
> Right?

Right.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 14:24                                                   ` Jason Gunthorpe
  2025-04-25 15:06                                                     ` Serge E. Hallyn
@ 2025-04-25 15:32                                                     ` Eric W. Biederman
  2025-04-25 16:21                                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-25 15:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Fri, Apr 25, 2025 at 09:01:44AM -0500, Serge E. Hallyn wrote:
>> On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
>> > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
>> > 
>> > > 1. In uobject creation syscall, I will add the check current->nsproxy->net->user_ns capability using ns_capable().
>> > > And we don't hold any reference for user ns.
>> > 
>> > This is the thing that makes my head ache.. Is that really the right
>> > way to get the user_ns of current? Is it possible that current has
>> > multiple user_ns's? We are picking nsproxy because ib_dev has a net
>> > namespace affiliation?
>> 
>> It's not that "current has multiple user_ns's", it's that the various
>> resources, including other namespaces, which current has or belongs
>> to have associated namespaces.
>
> That seems like splitting nits. Can I do current->XXX->user_ns and get
> different answers? Sounds like yes?

Totally.

current->cred->user_ns (aka current_user_ns) is the what the process
has.


Everything else is what some resource XXX has as it's user_ns.  We could
have made the resources have a uid and gid or even a full struct cred
for their access permissions but when all you want are capability checks
anything more than a struct user_ns pointer is overkill.

But it isn't nits.  It is like files:
AKA current->mnt_ns->mnt_root->d_inode->i_uid doesn't have to
be related to current->cred->uid either.

The reason current shows up so much is that there is a principle of
relativity involved where things that show up are relative to which
process is looking.  Because different processes have different
namespaces.

The relativity and everything else predates user namespaces.  User
namespaces just drags credentials into it all.

>> current_user_ns() is the user namespace to which current belongs.
>> But if you want to check if it can have privilege over a resource,
>> you have to check whether current has ns_capable(resource->userns, CAP_X).
>
> So what is the resource here?
>
> It is definitely not the file descriptor.
>
> Is it the kernel's struct ib_device? It has a netns that is captured
> at its creation time.

Yes.  Very much so.

To relax the permissions for access the change would need to be:
capable(CAP_NET_XXX)  -> ns_capable(ib_device->net->user_ns, CAP_NET_XXX)

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 15:27                                                       ` Parav Pandit
@ 2025-04-25 15:46                                                         ` Eric W. Biederman
  2025-04-25 16:16                                                           ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-25 15:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Serge E. Hallyn, Jason Gunthorpe, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Parav Pandit <parav@nvidia.com> writes:

>> From: Serge E. Hallyn <serge@hallyn.com>
>> Sent: Friday, April 25, 2025 8:37 PM
>> 
>> On Fri, Apr 25, 2025 at 11:24:29AM -0300, Jason Gunthorpe wrote:
>> > On Fri, Apr 25, 2025 at 09:01:44AM -0500, Serge E. Hallyn wrote:
>> > > On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
>> > > > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
>> > > >
>> > > > > 1. In uobject creation syscall, I will add the check current->nsproxy-
>> >net->user_ns capability using ns_capable().
>> > > > > And we don't hold any reference for user ns.
>> > > >
>> > > > This is the thing that makes my head ache.. Is that really the
>> > > > right way to get the user_ns of current? Is it possible that
>> > > > current has multiple user_ns's? We are picking nsproxy because
>> > > > ib_dev has a net namespace affiliation?
>> > >
>> > > It's not that "current has multiple user_ns's", it's that the
>> > > various resources, including other namespaces, which current has or
>> > > belongs to have associated namespaces.
>> >
>> > That seems like splitting nits. Can I do current->XXX->user_ns and get
>> > different answers? Sounds like yes?
>> 
>> I don't think it's splitting nits.  current->nsproxy->net_ns->user_ns is not
>> current's user namespace.
>> 
>> > > current_user_ns() is the user namespace to which current belongs.
>> > > But if you want to check if it can have privilege over a resource,
>> > > you have to check whether current has ns_capable(resource->userns,
>> CAP_X).
>> >
>> > So what is the resource here?
>> 
>> That's what I've been trying to get answered :)
>>
> A. When a raw socket is created to send a packet vis netdev
> (resource), the cap is checked against the process in [1].

No.

> [1] https://elixir.bootlin.com/linux/v6.14.3/source/net/ipv4/af_inet.c#L314
> Not against the netdev's dev_net().

There isn't a netdevide to check against.

The resource is the network namespace aka net.

The permission check is not solely against the processes credentials
in current->cred.

> B. Netdev's dev_net() is crossed checked to see if current process can access netdev ifindex or not in send() call.
>
> So resource was netdev, but literally the check is against the process
> cap.

Not at all.    The process (user_ns,cap) pair in current->cred is checked
against the user_ns that owns the network namespace aka net->user_ns.

That current is used to find the network namespace is irrelevant for
the permission check.

> And considering above is right,
>
> I try to draw the parallels between the two types of network devices,
>
> the check for rdma raw QP (equivalent of raw socket), 
> Should check similarly against the process's net->user_ns during QP creation time.
> This will match #A.
>
> And process to rdma access check should be a separate check #B.

The way you described it sounds wrong, but your conclusion of
what needs to be checked seems correct.

Eric



^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 15:46                                                         ` Eric W. Biederman
@ 2025-04-25 16:16                                                           ` Parav Pandit
  0 siblings, 0 replies; 58+ messages in thread
From: Parav Pandit @ 2025-04-25 16:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Jason Gunthorpe, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Eric W. Biederman <ebiederm@xmission.com>
> Sent: Friday, April 25, 2025 9:17 PM
> 
> Parav Pandit <parav@nvidia.com> writes:
> 
> >> From: Serge E. Hallyn <serge@hallyn.com>
> >> Sent: Friday, April 25, 2025 8:37 PM
> >>
> >> On Fri, Apr 25, 2025 at 11:24:29AM -0300, Jason Gunthorpe wrote:
> >> > On Fri, Apr 25, 2025 at 09:01:44AM -0500, Serge E. Hallyn wrote:
> >> > > On Fri, Apr 25, 2025 at 10:29:30AM -0300, Jason Gunthorpe wrote:
> >> > > > On Fri, Apr 25, 2025 at 01:14:35PM +0000, Parav Pandit wrote:
> >> > > >
> >> > > > > 1. In uobject creation syscall, I will add the check
> >> > > > > current->nsproxy-
> >> >net->user_ns capability using ns_capable().
> >> > > > > And we don't hold any reference for user ns.
> >> > > >
> >> > > > This is the thing that makes my head ache.. Is that really the
> >> > > > right way to get the user_ns of current? Is it possible that
> >> > > > current has multiple user_ns's? We are picking nsproxy because
> >> > > > ib_dev has a net namespace affiliation?
> >> > >
> >> > > It's not that "current has multiple user_ns's", it's that the
> >> > > various resources, including other namespaces, which current has
> >> > > or belongs to have associated namespaces.
> >> >
> >> > That seems like splitting nits. Can I do current->XXX->user_ns and
> >> > get different answers? Sounds like yes?
> >>
> >> I don't think it's splitting nits.  current->nsproxy->net_ns->user_ns
> >> is not current's user namespace.
> >>
> >> > > current_user_ns() is the user namespace to which current belongs.
> >> > > But if you want to check if it can have privilege over a
> >> > > resource, you have to check whether current has
> >> > > ns_capable(resource->userns,
> >> CAP_X).
> >> >
> >> > So what is the resource here?
> >>
> >> That's what I've been trying to get answered :)
> >>
> > A. When a raw socket is created to send a packet vis netdev
> > (resource), the cap is checked against the process in [1].
> 
> No.
> 
> > [1]
> > https://elixir.bootlin.com/linux/v6.14.3/source/net/ipv4/af_inet.c#L31
> > 4
> > Not against the netdev's dev_net().
> 
> There isn't a netdevide to check against.
> 

> The resource is the network namespace aka net.
>
Don't find any difference for rdma either.
For rdma too network namespace is the resource.
 
> The permission check is not solely against the processes credentials in current-
> >cred.
> 
> > B. Netdev's dev_net() is crossed checked to see if current process can access
> netdev ifindex or not in send() call.
> >
> > So resource was netdev, but literally the check is against the process
> > cap.
> 
> Not at all.    The process (user_ns,cap) pair in current->cred is checked
> against the user_ns that owns the network namespace aka net->user_ns.
> 
> That current is used to find the network namespace is irrelevant for the
> permission check.
> 
> > And considering above is right,
> >
> > I try to draw the parallels between the two types of network devices,
> >
> > the check for rdma raw QP (equivalent of raw socket), Should check
> > similarly against the process's net->user_ns during QP creation time.
> > This will match #A.
> >
> > And process to rdma access check should be a separate check #B.
> 
> The way you described it sounds wrong, but your conclusion of what needs to
> be checked seems correct.
> 
> Eric
> 
Since you described that net namespace is the resource, 
current->nsproxy->net->user_ns seems correct for rdma case too.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 15:32                                                     ` Eric W. Biederman
@ 2025-04-25 16:21                                                       ` Jason Gunthorpe
  2025-04-25 17:34                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-25 16:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Fri, Apr 25, 2025 at 10:32:27AM -0500, Eric W. Biederman wrote:
> > That seems like splitting nits. Can I do current->XXX->user_ns and get
> > different answers? Sounds like yes?
> 
> Totally.
> 
> current->cred->user_ns (aka current_user_ns) is the what the process
> has.

Well, this is the head hurty bit. "cred->user_ns" is not what the
process "has" if the kernel is checking resource->netns->user_ns for
the capability checks and ignores cred->user_ns?

How does a userspace process actually know what its current
capabilties are? Like how does it tell if CAP_NET_XX is actually
available?

What about something like CAP_SYS_RAWIO? I don't think we would ever
make that a per-userns thing, but as a thought experiment, do we check
current->XXX->user_ns or still check ibdev->netns->XX->user_ns?

> > Is it the kernel's struct ib_device? It has a netns that is captured
> > at its creation time.
> 
> Yes.  Very much so.

Okay.. And looking at this more we actually check that the process
that opens /dev/../uverbsX has the same net_ns as the ib_device:

static int ib_uverbs_open(struct inode *inode, struct file *filp)
{
	if (!rdma_dev_access_netns(ib_dev, current->nsproxy->net_ns)) {
		ret = -EPERM;

bool rdma_dev_access_netns(const struct ib_device *dev, const struct net *net)
{
	return (ib_devices_shared_netns ||
		net_eq(read_pnet(&dev->coredev.rdma_net), net));

So you can say we 'captured' the net_ns into the FD as there is some
struct file->....->ib_dev->..->net_ns that does not change

Thus ib_dev->...->user_ns is going to always be the user_ns of the
netns of the process that opened the FD.

So.. hopefully final question.. When we are in a system call context
and want to check CAP_NET_XX should we also require that the current
process has the same net ns as the ib_dev?

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 16:21                                                       ` Jason Gunthorpe
@ 2025-04-25 17:34                                                         ` Eric W. Biederman
  2025-04-25 18:20                                                           ` Parav Pandit
  2025-04-25 18:35                                                           ` Jason Gunthorpe
  0 siblings, 2 replies; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-25 17:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Fri, Apr 25, 2025 at 10:32:27AM -0500, Eric W. Biederman wrote:
>> > That seems like splitting nits. Can I do current->XXX->user_ns and get
>> > different answers? Sounds like yes?
>> 
>> Totally.
>> 
>> current->cred->user_ns (aka current_user_ns) is the what the process
>> has.
>
> Well, this is the head hurty bit. "cred->user_ns" is not what the
> process "has" if the kernel is checking resource->netns->user_ns for
> the capability checks and ignores cred->user_ns?

resource->net->user_ns CAP_XXXX is what the process needs.

current->cred->user_ns and current->cred->cap_effective
are what the process has.

> How does a userspace process actually know what its current
> capabilties are? Like how does it tell if CAP_NET_XX is actually
> available?

It looks in current->cred.  In current->cred it finds bits
set in cap_effective, and a user_ns.

Ultimately all capable calls make their way down into
cap_capable_helper.  The cap_capable_helper checks to see if the user
namespace that is wanted (aka from the resource) matches the cred's user
namespace.  If the namespaces match then the bits in cap_effective
are checked.

There are a few more checks that make nested user namespaces work.

> What about something like CAP_SYS_RAWIO? I don't think we would ever
> make that a per-userns thing, but as a thought experiment, do we check
> current->XXX->user_ns or still check ibdev->netns->XX->user_ns?
>

Oh.  CAP_SYS_RAWIO is totally is something you can have.  In fact
the first process in a user namespace starts out with CAP_SYS_RAWIO.
That said it is CAP_SYS_RAWIO with respect to the user namespace.

What would be almost certainly be a bug is for any permission check
to be relaxed to ns_capable(resource->user_ns, CAP_SYS_RAWIO).

>> > Is it the kernel's struct ib_device? It has a netns that is captured
>> > at its creation time.
>> 
>> Yes.  Very much so.
>
> Okay.. And looking at this more we actually check that the process
> that opens /dev/../uverbsX has the same net_ns as the ib_device:

At which point I see how different the ib_device model is from
everything else..

ib_dev only sometimes belongs to a network namespace
(ib_devices_shared_netns).  The rest of time they are independent of the
network namespaces.

I don't know what an infiniband character device refers to.  Is it an
attachment of a physical cable to the box like a netdevice?  Is it an
infiniband queue-pair?

Given that the character device names aren't properly attached to a
network namespace it seems reasonable to ensure that you can only
open infiniband devices that are in your network-namespace (when
that is enabled).

The names (device major and minor) not living in a network namespace
mean that there can be problems for CRIU to migrate a infiniband device,
as it's device major and minor number are not guaranteed to be
available.  Perhaps that doesn't matter, as the name you open is on a
filesystem.  *Shrug*

> static int ib_uverbs_open(struct inode *inode, struct file *filp)
> {
> 	if (!rdma_dev_access_netns(ib_dev, current->nsproxy->net_ns)) {
> 		ret = -EPERM;
>

> bool rdma_dev_access_netns(const struct ib_device *dev, const struct net *net)
> {
> 	return (ib_devices_shared_netns ||
> 		net_eq(read_pnet(&dev->coredev.rdma_net), net));
>
> So you can say we 'captured' the net_ns into the FD as there is some
> struct file->....->ib_dev->..->net_ns that does not change
>
> Thus ib_dev->...->user_ns is going to always be the user_ns of the
> netns of the process that opened the FD.

Nope.

There is no check against current->cred->user_ns.  So the check has
nothing to do with the credentials of the process that opened the
character device.

> So.. hopefully final question.. When we are in a system call context
> and want to check CAP_NET_XX should we also require that the current
> process has the same net ns as the ib_dev?

I want to say in general only for opening the ib_device.

I don't know what to say for the case where ib_devices_shared_netns is
true.  In that case the ib_device doesn't have a network namespace at
all, so at best it would appear to be a nonsense check.

I think you need to restrict the relaxation to the case where
ib_devices_shared_netns is false.

The network stack in general uses netlink to talk to network devices
(sockets are another matter), so this whole using character devices
to talk to devices is very weird to me.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 17:34                                                         ` Eric W. Biederman
@ 2025-04-25 18:20                                                           ` Parav Pandit
  2025-04-25 18:35                                                           ` Jason Gunthorpe
  1 sibling, 0 replies; 58+ messages in thread
From: Parav Pandit @ 2025-04-25 18:20 UTC (permalink / raw)
  To: Eric W. Biederman, Jason Gunthorpe
  Cc: Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Eric W. Biederman <ebiederm@xmission.com>
> Sent: Friday, April 25, 2025 11:04 PM

> I think you need to restrict the relaxation to the case where
> ib_devices_shared_netns is false.
>
Yep, this is what we agreed in [1].

[1] https://lore.kernel.org/linux-rdma/875xisf8ma.fsf@email.froward.int.ebiederm.org/T/#m70e2b41c70f69d38bb74b7ae4abda65da1fc5c69

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 17:34                                                         ` Eric W. Biederman
  2025-04-25 18:20                                                           ` Parav Pandit
@ 2025-04-25 18:35                                                           ` Jason Gunthorpe
  2025-04-27 14:30                                                             ` Serge E. Hallyn
  2025-04-28 17:03                                                             ` Eric W. Biederman
  1 sibling, 2 replies; 58+ messages in thread
From: Jason Gunthorpe @ 2025-04-25 18:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

On Fri, Apr 25, 2025 at 12:34:21PM -0500, Eric W. Biederman wrote:
> > What about something like CAP_SYS_RAWIO? I don't think we would ever
> > make that a per-userns thing, but as a thought experiment, do we check
> > current->XXX->user_ns or still check ibdev->netns->XX->user_ns?
> >
> 
> Oh.  CAP_SYS_RAWIO is totally is something you can have.  In fact
> the first process in a user namespace starts out with CAP_SYS_RAWIO.
> That said it is CAP_SYS_RAWIO with respect to the user namespace.
> 
> What would be almost certainly be a bug is for any permission check
> to be relaxed to ns_capable(resource->user_ns, CAP_SYS_RAWIO).

So a process "has" it but the kernel never accepts it?

> I don't know what an infiniband character device refers to.  Is it an
> attachment of a physical cable to the box like a netdevice?  Is it an
> infiniband queue-pair?

It refers to a single struct ib_device in the kernel. It is kind of a
like a namespace in that all the commands executed and uobjects
created on the FD are relative to the struct ib_device.

> The names (device major and minor) not living in a network namespace
> mean that there can be problems for CRIU to migrate a infiniband device,
> as it's device major and minor number are not guaranteed to be
> available.  Perhaps that doesn't matter, as the name you open is on a
> filesystem.  *Shrug*

I don't see a path for CRIU and rdma, there is too much hardware
state.. Presumably if anyone ever did it they'd have to ignore that
the major/minor changes.

> > static int ib_uverbs_open(struct inode *inode, struct file *filp)
> > {
> > 	if (!rdma_dev_access_netns(ib_dev, current->nsproxy->net_ns)) {
> > 		ret = -EPERM;
> >
> 
> > bool rdma_dev_access_netns(const struct ib_device *dev, const struct net *net)
> > {
> > 	return (ib_devices_shared_netns ||
> > 		net_eq(read_pnet(&dev->coredev.rdma_net), net));
> >
> > So you can say we 'captured' the net_ns into the FD as there is some
> > struct file->....->ib_dev->..->net_ns that does not change
> >
> > Thus ib_dev->...->user_ns is going to always be the user_ns of the
> > netns of the process that opened the FD.
> 
> Nope.
> 
> There is no check against current->cred->user_ns.  So the check has
> nothing to do with the credentials of the process that opened the
> character device.

I said "user_ns of the netns"?  Credentials of the process is something
else?

It sounds like we just totally ignore current->cred->user_ns from the
rdma subsystem perspective?

> > So.. hopefully final question.. When we are in a system call context
> > and want to check CAP_NET_XX should we also require that the current
> > process has the same net ns as the ib_dev?
> 
> I want to say in general only for opening the ib_device.
> 
> I don't know what to say for the case where ib_devices_shared_netns is
> true. In that case the ib_device doesn't have a network namespace at
> all, so at best it would appear to be a nonsense check.

In shared mode it has no namespace containment at all, presumably any
capable checks should continue to be done on the init_net?

> I think you need to restrict the relaxation to the case where
> ib_devices_shared_netns is false.

I think we will never check anything other than init_net in shared
mode.

> The network stack in general uses netlink to talk to network devices
> (sockets are another matter), so this whole using character devices
> to talk to devices is very weird to me.

It isn't that different. In netlink you get the FD through socket, in
char dev you get it through open.

In netdev you can't "open" eth1 but in most other subsystem you can
get a FD that encapsulates a physical device.

Jason

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 18:35                                                           ` Jason Gunthorpe
@ 2025-04-27 14:30                                                             ` Serge E. Hallyn
  2025-04-28 17:03                                                             ` Eric W. Biederman
  1 sibling, 0 replies; 58+ messages in thread
From: Serge E. Hallyn @ 2025-04-27 14:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric W. Biederman, Serge E. Hallyn, Parav Pandit,
	linux-rdma@vger.kernel.org, linux-security-module@vger.kernel.org,
	Leon Romanovsky

On Fri, Apr 25, 2025 at 03:35:29PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 25, 2025 at 12:34:21PM -0500, Eric W. Biederman wrote:
> > > What about something like CAP_SYS_RAWIO? I don't think we would ever
> > > make that a per-userns thing, but as a thought experiment, do we check
> > > current->XXX->user_ns or still check ibdev->netns->XX->user_ns?
> > >
> > 
> > Oh.  CAP_SYS_RAWIO is totally is something you can have.  In fact
> > the first process in a user namespace starts out with CAP_SYS_RAWIO.
> > That said it is CAP_SYS_RAWIO with respect to the user namespace.
> > 
> > What would be almost certainly be a bug is for any permission check
> > to be relaxed to ns_capable(resource->user_ns, CAP_SYS_RAWIO).
> 
> So a process "has" it but the kernel never accepts it?

Capabilities are targeted at some resource.  Sometimes the resource is
global, or always belongs to the initial user namespace.  In the case
of rawio, if ever "device namespaces" became acceptable, then it could
in fact become namespaced for some resources.

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-25 18:35                                                           ` Jason Gunthorpe
  2025-04-27 14:30                                                             ` Serge E. Hallyn
@ 2025-04-28 17:03                                                             ` Eric W. Biederman
  2025-04-29  3:56                                                               ` Eric W. Biederman
  2025-04-29 10:39                                                               ` Parav Pandit
  1 sibling, 2 replies; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-28 17:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Fri, Apr 25, 2025 at 12:34:21PM -0500, Eric W. Biederman wrote:
>> > What about something like CAP_SYS_RAWIO? I don't think we would ever
>> > make that a per-userns thing, but as a thought experiment, do we check
>> > current->XXX->user_ns or still check ibdev->netns->XX->user_ns?
>> >
>> 
>> Oh.  CAP_SYS_RAWIO is totally is something you can have.  In fact
>> the first process in a user namespace starts out with CAP_SYS_RAWIO.
>> That said it is CAP_SYS_RAWIO with respect to the user namespace.
>> 
>> What would be almost certainly be a bug is for any permission check
>> to be relaxed to ns_capable(resource->user_ns, CAP_SYS_RAWIO).
>
> So a process "has" it but the kernel never accepts it?

Exactly.  Most capabilities possessed by root in a user namespace are
never allowed to do anything or at very rarely allowed to do anything.

Semantically the only things root in a user namespace is allowed to
do are those things that the we only disallow because it would confuse
setuid programs, but are otherwise perfectly fine for an unprivileged
process to perform.

>> I don't know what an infiniband character device refers to.  Is it an
>> attachment of a physical cable to the box like a netdevice?  Is it an
>> infiniband queue-pair?
>
> It refers to a single struct ib_device in the kernel. It is kind of a
> like a namespace in that all the commands executed and uobjects
> created on the FD are relative to the struct ib_device.

Unfortunately I am not familiar with the infiniband kernel abstractions.

>> The names (device major and minor) not living in a network namespace
>> mean that there can be problems for CRIU to migrate a infiniband device,
>> as it's device major and minor number are not guaranteed to be
>> available.  Perhaps that doesn't matter, as the name you open is on a
>> filesystem.  *Shrug*
>
> I don't see a path for CRIU and rdma, there is too much hardware
> state.. Presumably if anyone ever did it they'd have to ignore that
> the major/minor changes.

At this point I would be surprised if there was sufficient resources
being put at the problem.  In principle the hardware state is a
simplified version of a TCP/IP connection and some DMA transactions,
so it isn't an unreasonable thing to imagine.

Especially as it is usually the folks who have long running jobs on
clusters (because no single machine is big enough) that are both use
infiniband and are interested in checkpoint-restart.

The CRIU question is always relevant to ask in a namespace context
as that is very much part of the goal of namespaces.  To abstract
out the names so that migration of resources from one machine
can be implemented in the future.

>> > static int ib_uverbs_open(struct inode *inode, struct file *filp)
>> > {
>> > 	if (!rdma_dev_access_netns(ib_dev, current->nsproxy->net_ns)) {
>> > 		ret = -EPERM;
>> >
>> 
>> > bool rdma_dev_access_netns(const struct ib_device *dev, const struct net *net)
>> > {
>> > 	return (ib_devices_shared_netns ||
>> > 		net_eq(read_pnet(&dev->coredev.rdma_net), net));
>> >
>> > So you can say we 'captured' the net_ns into the FD as there is some
>> > struct file->....->ib_dev->..->net_ns that does not change
>> >
>> > Thus ib_dev->...->user_ns is going to always be the user_ns of the
>> > netns of the process that opened the FD.
>> 
>> Nope.
>> 
>> There is no check against current->cred->user_ns.  So the check has
>> nothing to do with the credentials of the process that opened the
>> character device.
>
> I said "user_ns of the netns"?  Credentials of the process is something
> else?

Exactly the credentials of the a process are not:
	current->nsproxy->net_ns->user_ns;  /* Not this */

The credentials of a process are:
	current->cred;  /* This */

With current->cred->user_ns the current processes user namespace.

> It sounds like we just totally ignore current->cred->user_ns from the
> rdma subsystem perspective?

Since you don't allow anything currently to happen in a user namespace
that is completely reasonable.

Once ns_capable checks start being added that changes.

>> > So.. hopefully final question.. When we are in a system call context
>> > and want to check CAP_NET_XX should we also require that the current
>> > process has the same net ns as the ib_dev?
>> 
>> I want to say in general only for opening the ib_device.

The information I don't have to give a better answer is what
an ib_device actually is from a hardware perspective.  That
is why I was asking about queue-pairs above.

>> The network stack in general uses netlink to talk to network devices
>> (sockets are another matter), so this whole using character devices
>> to talk to devices is very weird to me.
>
> It isn't that different. In netlink you get the FD through socket, in
> char dev you get it through open.

It is just enough different that there are no good examples I can point
you at to say what needs to happen here.

For netlink it is totally clear that all of the permission checks need
to happen at open time.  Because the sender of the netlink message
may be on another machine.

For a FD opened through a character device everything is local and
you are still in the context of the requesting process so permission
checks can still happen, and aren't completely silly.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-28 17:03                                                             ` Eric W. Biederman
@ 2025-04-29  3:56                                                               ` Eric W. Biederman
  2025-04-29 10:39                                                               ` Parav Pandit
  1 sibling, 0 replies; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-29  3:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serge E. Hallyn, Parav Pandit, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Jason Gunthorpe <jgg@nvidia.com> writes:
>

>> It sounds like we just totally ignore current->cred->user_ns from the
>> rdma subsystem perspective?
>
> Since you don't allow anything currently to happen in a user namespace
> that is completely reasonable.
>
> Once ns_capable checks start being added that changes.

My apologies I misspoke.

Where infiniband currently uses current->cred->user_ns is in calls to
"capable()".

That will continue if those calls are relaxed to "ns_capable()".

All of which makes sense fundamentally because the only place it
really makes sense to look at the credentials of a process is in
the permission checks.

Eric



^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-28 17:03                                                             ` Eric W. Biederman
  2025-04-29  3:56                                                               ` Eric W. Biederman
@ 2025-04-29 10:39                                                               ` Parav Pandit
  2025-04-30  3:34                                                                 ` Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Parav Pandit @ 2025-04-29 10:39 UTC (permalink / raw)
  To: Eric W. Biederman, Jason Gunthorpe
  Cc: Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Eric W. Biederman <ebiederm@xmission.com>
> Sent: Monday, April 28, 2025 10:34 PM

[..]
> > I said "user_ns of the netns"?  Credentials of the process is
> > something else?
> 
> Exactly the credentials of the a process are not:
> 	current->nsproxy->net_ns->user_ns;  /* Not this */
> 
> The credentials of a process are:
> 	current->cred;  /* This */
> 
> With current->cred->user_ns the current processes user namespace.
> 
I am confused with your above response.
In response [1], you described that net ns is the resource,
hence resource's user namespace is considered.
And your response [1] also aligns to existing code of [2] and many similar conversions done by your commit 276996fda0f33.

[1] https://lore.kernel.org/linux-rdma/87ikmnd3j6.fsf@email.froward.int.ebiederm.org/T/#me5983d8248de0ff9670644c57d71009debaedd6f
[2] https://elixir.bootlin.com/linux/v6.14.3/source/net/ipv4/af_inet.c#L314

So in infiniband, when I replace existing capable() with ns_capable(), 
shouldn't I use current->nsproxy->net_ns->user_ns following [1] and [2], because for infiniband too, the resource is net namespace.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-29 10:39                                                               ` Parav Pandit
@ 2025-04-30  3:34                                                                 ` Eric W. Biederman
  2025-04-30 12:14                                                                   ` Parav Pandit
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2025-04-30  3:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky

Parav Pandit <parav@nvidia.com> writes:

>> From: Eric W. Biederman <ebiederm@xmission.com>
>> Sent: Monday, April 28, 2025 10:34 PM
>
> [..]
>> > I said "user_ns of the netns"?  Credentials of the process is
>> > something else?
>> 
>> Exactly the credentials of the a process are not:
>> 	current->nsproxy->net_ns->user_ns;  /* Not this */
>> 
>> The credentials of a process are:
>> 	current->cred;  /* This */
>> 
>> With current->cred->user_ns the current processes user namespace.
>> 
> I am confused with your above response.
> In response [1], you described that net ns is the resource,
> hence resource's user namespace is considered.
> And your response [1] also aligns to existing code of [2] and many similar conversions done by your commit 276996fda0f33.
>
> [1] https://lore.kernel.org/linux-rdma/87ikmnd3j6.fsf@email.froward.int.ebiederm.org/T/#me5983d8248de0ff9670644c57d71009debaedd6f
> [2] https://elixir.bootlin.com/linux/v6.14.3/source/net/ipv4/af_inet.c#L314
>
> So in infiniband, when I replace existing capable() with ns_capable(), 
> shouldn't I use current->nsproxy->net_ns->user_ns following [1] and
> [2], because for infiniband too, the resource is net namespace.

Almost.

It is true that current->nsproxy->net_ns matches ib_device->net_ns at
open time, but those permission checks don't happen at open time.

After open time you want ib_device->net_ns.  Not
current->nsproxy->net_ns.

At which point your ns_capable call will look something like:

	ns_capable(ib_device->net_ns->user_ns, CAP_NET_RAW);

That ns_capable call will then check

ib_device->net_ns->user_ns against
current->cred->user_ns.

And it will verify that CAP_NET_RAW is in
current->cred->cap_effect.

Thus checking the resource (the ib_device) against the current
process's credentials.

----

The danger of using current->nsproxy->net_ns->user ns after
open time is the caller may have done.

unshare(CLONE_NEWUSER);
unshare(CLONE_NEWNET);

At which point
"ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)"
is guaranteed to be true.

But it isn't meaningful because there are be no ib_devices in that
network namespace.

----

Because of the shared device stuff a relaxed permission check
would actually need to look more like.

	struct user_ns *user_ns = shared ? &init_user_ns : ib_device->net_ns->user_ns;
        ns_capable(user_ns, CAP_NET_RAW);

This allows sharing the capable call for better maintenance but only
relaxing the permission check for the other cases.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] RDMA/uverbs: Consider capability of the process that opens the file
  2025-04-30  3:34                                                                 ` Eric W. Biederman
@ 2025-04-30 12:14                                                                   ` Parav Pandit
  0 siblings, 0 replies; 58+ messages in thread
From: Parav Pandit @ 2025-04-30 12:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jason Gunthorpe, Serge E. Hallyn, linux-rdma@vger.kernel.org,
	linux-security-module@vger.kernel.org, Leon Romanovsky


> From: Eric W. Biederman <ebiederm@xmission.com>
> Sent: Wednesday, April 30, 2025 9:05 AM
> 
> Parav Pandit <parav@nvidia.com> writes:
> 
> >> From: Eric W. Biederman <ebiederm@xmission.com>
> >> Sent: Monday, April 28, 2025 10:34 PM
> >
> > [..]
> >> > I said "user_ns of the netns"?  Credentials of the process is
> >> > something else?
> >>
> >> Exactly the credentials of the a process are not:
> >> 	current->nsproxy->net_ns->user_ns;  /* Not this */
> >>
> >> The credentials of a process are:
> >> 	current->cred;  /* This */
> >>
> >> With current->cred->user_ns the current processes user namespace.
> >>
> > I am confused with your above response.
> > In response [1], you described that net ns is the resource, hence
> > resource's user namespace is considered.
> > And your response [1] also aligns to existing code of [2] and many similar
> conversions done by your commit 276996fda0f33.
> >
> > [1]
> > https://lore.kernel.org/linux-rdma/87ikmnd3j6.fsf@email.froward.int.eb
> > iederm.org/T/#me5983d8248de0ff9670644c57d71009debaedd6f
> > [2]
> > https://elixir.bootlin.com/linux/v6.14.3/source/net/ipv4/af_inet.c#L31
> > 4
> >
> > So in infiniband, when I replace existing capable() with ns_capable(),
> > shouldn't I use current->nsproxy->net_ns->user_ns following [1] and
> > [2], because for infiniband too, the resource is net namespace.
> 
> Almost.
> 
> It is true that current->nsproxy->net_ns matches ib_device->net_ns at open
> time, but those permission checks don't happen at open time.
> 
> After open time you want ib_device->net_ns.  Not
> current->nsproxy->net_ns.
> 
> At which point your ns_capable call will look something like:
> 
> 	ns_capable(ib_device->net_ns->user_ns, CAP_NET_RAW);
> 
> That ns_capable call will then check
> 
> ib_device->net_ns->user_ns against
> current->cred->user_ns.
> 
> And it will verify that CAP_NET_RAW is in
> current->cred->cap_effect.
> 
> Thus checking the resource (the ib_device) against the current process's
> credentials.
> 
> ----
> 
> The danger of using current->nsproxy->net_ns->user ns after open time is the
> caller may have done.
> 
> unshare(CLONE_NEWUSER);
> unshare(CLONE_NEWNET);
> 
> At which point
> "ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)"
> is guaranteed to be true.
> 
> But it isn't meaningful because there are be no ib_devices in that network
> namespace.
>
True, but the resource was net namespace and not the ib device.
The capability is of the network namespace that is checked against.

But I think I can ib_device check as well.

> ----
> 
> Because of the shared device stuff a relaxed permission check would actually
> need to look more like.
> 
> 	struct user_ns *user_ns = shared ? &init_user_ns : ib_device->net_ns-
> >user_ns;
>         ns_capable(user_ns, CAP_NET_RAW);
> 
> This allows sharing the capable call for better maintenance but only relaxing
> the permission check for the other cases.
>
Yes, this was the plan.

Thanks a lot for the guidance. If no further comments, I will send out v1 adopting above suggestions.
 
> Eric
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-04-30 12:14 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-13  5:08 [PATCH] RDMA/uverbs: Consider capability of the process that opens the file Parav Pandit
2025-03-17 19:31 ` Jason Gunthorpe
2025-03-18  3:43   ` Parav Pandit
2025-03-18 11:20     ` Jason Gunthorpe
2025-03-18 12:30       ` Parav Pandit
2025-03-18 12:44         ` Jason Gunthorpe
2025-03-18 20:00       ` Eric W. Biederman
2025-03-18 22:57         ` Jason Gunthorpe
2025-04-04 14:53           ` Parav Pandit
2025-04-04 15:13             ` Jason Gunthorpe
2025-04-06 14:15               ` Serge E. Hallyn
2025-04-07 11:16                 ` Parav Pandit
2025-04-07 14:46                   ` sergeh
2025-04-20 12:30                     ` Parav Pandit
2025-04-20 13:41                       ` Serge E. Hallyn
2025-04-20 17:31                         ` Parav Pandit
2025-04-07 16:12                   ` Jason Gunthorpe
2025-04-08 14:44                     ` Eric W. Biederman
2025-04-21  3:13             ` Serge E. Hallyn
2025-04-21 11:04               ` Parav Pandit
2025-04-21 13:00                 ` Serge E. Hallyn
2025-04-21 13:33                   ` Parav Pandit
2025-04-21 17:22                     ` Serge E. Hallyn
2025-04-22 12:46                       ` Jason Gunthorpe
2025-04-22 13:14                         ` Serge E. Hallyn
2025-04-22 16:11                           ` Jason Gunthorpe
2025-04-22 16:29                             ` Serge E. Hallyn
2025-04-23 12:41                               ` Parav Pandit
2025-04-23 14:46                                 ` Jason Gunthorpe
2025-04-23 15:43                                   ` Eric W. Biederman
2025-04-23 15:56                                     ` Parav Pandit
2025-04-23 16:45                                       ` Jason Gunthorpe
2025-04-24  9:08                                         ` Parav Pandit
2025-04-24 14:13                                           ` Jason Gunthorpe
2025-04-25 13:14                                             ` Parav Pandit
2025-04-25 13:29                                               ` Jason Gunthorpe
2025-04-25 13:54                                                 ` Parav Pandit
2025-04-25 14:06                                                   ` Serge E. Hallyn
2025-04-25 15:05                                                     ` Parav Pandit
2025-04-25 15:29                                                       ` Serge E. Hallyn
2025-04-25 13:59                                                 ` Serge E. Hallyn
2025-04-25 14:01                                                 ` Serge E. Hallyn
2025-04-25 14:24                                                   ` Jason Gunthorpe
2025-04-25 15:06                                                     ` Serge E. Hallyn
2025-04-25 15:27                                                       ` Parav Pandit
2025-04-25 15:46                                                         ` Eric W. Biederman
2025-04-25 16:16                                                           ` Parav Pandit
2025-04-25 15:32                                                     ` Eric W. Biederman
2025-04-25 16:21                                                       ` Jason Gunthorpe
2025-04-25 17:34                                                         ` Eric W. Biederman
2025-04-25 18:20                                                           ` Parav Pandit
2025-04-25 18:35                                                           ` Jason Gunthorpe
2025-04-27 14:30                                                             ` Serge E. Hallyn
2025-04-28 17:03                                                             ` Eric W. Biederman
2025-04-29  3:56                                                               ` Eric W. Biederman
2025-04-29 10:39                                                               ` Parav Pandit
2025-04-30  3:34                                                                 ` Eric W. Biederman
2025-04-30 12:14                                                                   ` Parav Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).