linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
       [not found] <20231204175342.3418422-1-kbusch@meta.com>
@ 2023-12-04 18:40 ` Jeff Moyer
  2023-12-04 18:57   ` Keith Busch
  2023-12-04 19:01   ` Jens Axboe
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff Moyer @ 2023-12-04 18:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, io-uring, axboe, hch, sagi, asml.silence, Keith Busch,
	linux-security-module

I added a CC: linux-security-module@vger

Hi, Keith,

Keith Busch <kbusch@meta.com> writes:

> From: Keith Busch <kbusch@kernel.org>
>
> The uring_cmd operation is often used for privileged actions, so drivers
> subscribing to this interface check capable() for each command. The
> capable() function is not fast path friendly for many kernel configs,
> and this can really harm performance. Stash the capable sys admin
> attribute in the io_uring context and set a new issue_flag for the
> uring_cmd interface.

I have a few questions.  What privileged actions are performance
sensitive?  I would hope that anything requiring privileges would not be
in a fast path (but clearly that's not the case).  What performance
benefits did you measure with this patch set in place (and on what
workloads)?  What happens when a ring fd is passed to another process?

Finally, as Jens mentioned, I would expect dropping priviliges to, you
know, drop privileges.  I don't think a commit message is going to be
enough documentation for a change like this.

Cheers,
Jeff

>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  include/linux/io_uring_types.h | 4 ++++
>  io_uring/io_uring.c            | 1 +
>  io_uring/uring_cmd.c           | 2 ++
>  3 files changed, 7 insertions(+)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index bebab36abce89..d64d6916753f0 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -36,6 +36,9 @@ enum io_uring_cmd_flags {
>  	/* set when uring wants to cancel a previously issued command */
>  	IO_URING_F_CANCEL		= (1 << 11),
>  	IO_URING_F_COMPAT		= (1 << 12),
> +
> +	/* ring validated as CAP_SYS_ADMIN capable */
> +	IO_URING_F_SYS_ADMIN		= (1 << 13),
>  };
>  
>  struct io_wq_work_node {
> @@ -240,6 +243,7 @@ struct io_ring_ctx {
>  		unsigned int		poll_activated: 1;
>  		unsigned int		drain_disabled: 1;
>  		unsigned int		compat: 1;
> +		unsigned int		sys_admin: 1;
>  
>  		struct task_struct	*submitter_task;
>  		struct io_rings		*rings;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>  		ctx->syscall_iopoll = 1;
>  
>  	ctx->compat = in_compat_syscall();
> +	ctx->sys_admin = capable(CAP_SYS_ADMIN);
>  	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>  		ctx->user = get_uid(current_user());
>  
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8a38b9f75d841..764f0e004aa00 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>  		issue_flags |= IO_URING_F_CQE32;
>  	if (ctx->compat)
>  		issue_flags |= IO_URING_F_COMPAT;
> +	if (ctx->sys_admin)
> +		issue_flags |= IO_URING_F_SYS_ADMIN;
>  	if (ctx->flags & IORING_SETUP_IOPOLL) {
>  		if (!file->f_op->uring_cmd_iopoll)
>  			return -EOPNOTSUPP;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-04 18:40 ` [PATCH 1/2] iouring: one capable call per iouring instance Jeff Moyer
@ 2023-12-04 18:57   ` Keith Busch
  2023-12-05  4:14     ` Ming Lei
  2023-12-04 19:01   ` Jens Axboe
  1 sibling, 1 reply; 14+ messages in thread
From: Keith Busch @ 2023-12-04 18:57 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Keith Busch, linux-nvme, io-uring, axboe, hch, sagi, asml.silence,
	linux-security-module

On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> I added a CC: linux-security-module@vger
> Keith Busch <kbusch@meta.com> writes:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > The uring_cmd operation is often used for privileged actions, so drivers
> > subscribing to this interface check capable() for each command. The
> > capable() function is not fast path friendly for many kernel configs,
> > and this can really harm performance. Stash the capable sys admin
> > attribute in the io_uring context and set a new issue_flag for the
> > uring_cmd interface.
> 
> I have a few questions.  What privileged actions are performance
> sensitive? I would hope that anything requiring privileges would not
> be in a fast path (but clearly that's not the case).

Protocol specifics that don't have a generic equivalent. For example,
NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
but you use it like normal reads and writes so has to be as fast as the
generic interfaces.

The same interfaces can be abused, so access needs to be restricted.

> What performance benefits did you measure with this patch set in place
> (and on what workloads)? 

Quite a bit. Here's a random read high-depth workload on a single
device test:

Before: 970k IOPs
After: 1750k IOPs

> What happens when a ring fd is passed to another process?
> 
> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> know, drop privileges.  I don't think a commit message is going to be
> enough documentation for a change like this.

Yeah, point taken.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-04 18:40 ` [PATCH 1/2] iouring: one capable call per iouring instance Jeff Moyer
  2023-12-04 18:57   ` Keith Busch
@ 2023-12-04 19:01   ` Jens Axboe
  2023-12-04 19:22     ` Jeff Moyer
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2023-12-04 19:01 UTC (permalink / raw)
  To: Jeff Moyer, Keith Busch
  Cc: linux-nvme, io-uring, hch, sagi, asml.silence, Keith Busch,
	linux-security-module

On 12/4/23 11:40 AM, Jeff Moyer wrote:
> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> know, drop privileges.  I don't think a commit message is going to be
> enough documentation for a change like this.

Only thing I can think of here is to cache the state in
task->io_uring->something, and then ensure those are invalidated
whenever caps change. It's one of those cases where that's probably only
done once, but we do need to be able to catch it. Not convinced that
caching it at ring creation is sane enough, even if it is kind of like
opening devices before privs are dropped where you could not otherwise
re-open them later on.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-04 19:01   ` Jens Axboe
@ 2023-12-04 19:22     ` Jeff Moyer
  2023-12-04 19:33       ` Jens Axboe
  2023-12-04 19:37       ` Keith Busch
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff Moyer @ 2023-12-04 19:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, linux-nvme, io-uring, hch, sagi, asml.silence,
	Keith Busch, linux-security-module

Jens Axboe <axboe@kernel.dk> writes:

> On 12/4/23 11:40 AM, Jeff Moyer wrote:
>> Finally, as Jens mentioned, I would expect dropping priviliges to, you
>> know, drop privileges.  I don't think a commit message is going to be
>> enough documentation for a change like this.
>
> Only thing I can think of here is to cache the state in
> task->io_uring->something, and then ensure those are invalidated
> whenever caps change.

I looked through the capable() code, and there is no way that I could
find to be notified of changes.

> It's one of those cases where that's probably only done once, but we
> do need to be able to catch it. Not convinced that caching it at ring
> creation is sane enough, even if it is kind of like opening devices
> before privs are dropped where you could not otherwise re-open them
> later on.

Agreed.

-Jeff


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-04 19:22     ` Jeff Moyer
@ 2023-12-04 19:33       ` Jens Axboe
  2023-12-04 19:37       ` Keith Busch
  1 sibling, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2023-12-04 19:33 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Keith Busch, linux-nvme, io-uring, hch, sagi, asml.silence,
	Keith Busch, linux-security-module

On 12/4/23 12:22 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 12/4/23 11:40 AM, Jeff Moyer wrote:
>>> Finally, as Jens mentioned, I would expect dropping priviliges to, you
>>> know, drop privileges.  I don't think a commit message is going to be
>>> enough documentation for a change like this.
>>
>> Only thing I can think of here is to cache the state in
>> task->io_uring->something, and then ensure those are invalidated
>> whenever caps change.
> 
> I looked through the capable() code, and there is no way that I could
> find to be notified of changes.

Right, what I meant is that you'd need to add an io_uring_cap_change()
or something that gets called, and that iterates the rings associated
with that task and clears the flag. Ugly...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-04 19:22     ` Jeff Moyer
  2023-12-04 19:33       ` Jens Axboe
@ 2023-12-04 19:37       ` Keith Busch
  1 sibling, 0 replies; 14+ messages in thread
From: Keith Busch @ 2023-12-04 19:37 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Keith Busch, linux-nvme, io-uring, hch, sagi,
	asml.silence, linux-security-module

On Mon, Dec 04, 2023 at 02:22:22PM -0500, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
> > On 12/4/23 11:40 AM, Jeff Moyer wrote:
> >> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> >> know, drop privileges.  I don't think a commit message is going to be
> >> enough documentation for a change like this.
> >
> > Only thing I can think of here is to cache the state in
> > task->io_uring->something, and then ensure those are invalidated
> > whenever caps change.
> 
> I looked through the capable() code, and there is no way that I could
> find to be notified of changes.

Something like LSM_HOOK_INIT on 'capset', but needs to work without
CONFIG_SECURITY.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-04 18:57   ` Keith Busch
@ 2023-12-05  4:14     ` Ming Lei
  2023-12-05  4:31       ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2023-12-05  4:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > I added a CC: linux-security-module@vger
> > Keith Busch <kbusch@meta.com> writes:
> > > From: Keith Busch <kbusch@kernel.org>
> > >
> > > The uring_cmd operation is often used for privileged actions, so drivers
> > > subscribing to this interface check capable() for each command. The
> > > capable() function is not fast path friendly for many kernel configs,
> > > and this can really harm performance. Stash the capable sys admin
> > > attribute in the io_uring context and set a new issue_flag for the
> > > uring_cmd interface.
> > 
> > I have a few questions.  What privileged actions are performance
> > sensitive? I would hope that anything requiring privileges would not
> > be in a fast path (but clearly that's not the case).
> 
> Protocol specifics that don't have a generic equivalent. For example,
> NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> but you use it like normal reads and writes so has to be as fast as the
> generic interfaces.

But normal read/write pt command doesn't require ADMIN any more since 
commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-05  4:14     ` Ming Lei
@ 2023-12-05  4:31       ` Keith Busch
  2023-12-05  5:25         ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2023-12-05  4:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > > I added a CC: linux-security-module@vger
> > > Keith Busch <kbusch@meta.com> writes:
> > > > From: Keith Busch <kbusch@kernel.org>
> > > >
> > > > The uring_cmd operation is often used for privileged actions, so drivers
> > > > subscribing to this interface check capable() for each command. The
> > > > capable() function is not fast path friendly for many kernel configs,
> > > > and this can really harm performance. Stash the capable sys admin
> > > > attribute in the io_uring context and set a new issue_flag for the
> > > > uring_cmd interface.
> > > 
> > > I have a few questions.  What privileged actions are performance
> > > sensitive? I would hope that anything requiring privileges would not
> > > be in a fast path (but clearly that's not the case).
> > 
> > Protocol specifics that don't have a generic equivalent. For example,
> > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> > but you use it like normal reads and writes so has to be as fast as the
> > generic interfaces.
> 
> But normal read/write pt command doesn't require ADMIN any more since 
> commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
> why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?

Good question. The "capable" check had always been first so even with
the relaxed permissions, it was still paying the price. I have changed
that order in commit staged here (not yet upstream):

  http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622

Note that only prevents the costly capable() check if the inexpensive
checks could make a determination. That's still not solving the problem
long term since we aim for forward compatibility where we have no idea
which opcodes, admin identifications, or vendor specifics could be
deemed "safe" for non-root users in the future, so those conditions
would always fall back to the more expensive check that this patch was
trying to mitigate for admin processes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-05  4:31       ` Keith Busch
@ 2023-12-05  5:25         ` Ming Lei
  2023-12-05 15:45           ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2023-12-05  5:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> > > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > > > I added a CC: linux-security-module@vger
> > > > Keith Busch <kbusch@meta.com> writes:
> > > > > From: Keith Busch <kbusch@kernel.org>
> > > > >
> > > > > The uring_cmd operation is often used for privileged actions, so drivers
> > > > > subscribing to this interface check capable() for each command. The
> > > > > capable() function is not fast path friendly for many kernel configs,
> > > > > and this can really harm performance. Stash the capable sys admin
> > > > > attribute in the io_uring context and set a new issue_flag for the
> > > > > uring_cmd interface.
> > > > 
> > > > I have a few questions.  What privileged actions are performance
> > > > sensitive? I would hope that anything requiring privileges would not
> > > > be in a fast path (but clearly that's not the case).
> > > 
> > > Protocol specifics that don't have a generic equivalent. For example,
> > > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> > > but you use it like normal reads and writes so has to be as fast as the
> > > generic interfaces.
> > 
> > But normal read/write pt command doesn't require ADMIN any more since 
> > commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
> > why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?
> 
> Good question. The "capable" check had always been first so even with
> the relaxed permissions, it was still paying the price. I have changed
> that order in commit staged here (not yet upstream):
> 
>   http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622

With this change, I guess you shouldn't see the following big gap, right?

> Before: 970k IOPs
> After: 1750k IOPs

> 
> Note that only prevents the costly capable() check if the inexpensive
> checks could make a determination. That's still not solving the problem
> long term since we aim for forward compatibility where we have no idea
> which opcodes, admin identifications, or vendor specifics could be
> deemed "safe" for non-root users in the future, so those conditions
> would always fall back to the more expensive check that this patch was
> trying to mitigate for admin processes.

Not sure I get the idea, it is related with nvme's permission model for
user pt command, and:

1) it should be always checked in entry of nvme user pt command

2) only the following two types of commands require ADMIN, per commit
855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")

    - any admin-cmd is not allowed
    - vendor-specific and fabric commmand are not allowed

Can you provide more details why the expensive check can't be avoided for
fast read/write user IO commands?

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-05  5:25         ` Ming Lei
@ 2023-12-05 15:45           ` Keith Busch
  2023-12-06  3:08             ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2023-12-05 15:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> > Good question. The "capable" check had always been first so even with
> > the relaxed permissions, it was still paying the price. I have changed
> > that order in commit staged here (not yet upstream):
> > 
> >   http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
> 
> With this change, I guess you shouldn't see the following big gap, right?

Correct.
 
> > Before: 970k IOPs
> > After: 1750k IOPs
 
> > Note that only prevents the costly capable() check if the inexpensive
> > checks could make a determination. That's still not solving the problem
> > long term since we aim for forward compatibility where we have no idea
> > which opcodes, admin identifications, or vendor specifics could be
> > deemed "safe" for non-root users in the future, so those conditions
> > would always fall back to the more expensive check that this patch was
> > trying to mitigate for admin processes.
> 
> Not sure I get the idea, it is related with nvme's permission model for
> user pt command, and:
> 
> 1) it should be always checked in entry of nvme user pt command
> 
> 2) only the following two types of commands require ADMIN, per commit
> 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
> 
>     - any admin-cmd is not allowed
>     - vendor-specific and fabric commmand are not allowed
> 
> Can you provide more details why the expensive check can't be avoided for
> fast read/write user IO commands?

It's not necessarily about the read/write passthrough commands. It's for
commands we don't know about today. Do we want to revisit this problem
every time spec provides another operation? Are vendor unique solutions
not allowed to get high IOPs access?

Secondly, some people have rediscovered you can abuse this interface to
corrupt kernel memory, so there are considerations to restricting this
to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
have to go that route.

Lastly (and least important), there are a lot of checks happening in the
"quick" path that I am trying to replace with a single check. While each
invidividual check isn't too bad, they start to add up.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-05 15:45           ` Keith Busch
@ 2023-12-06  3:08             ` Ming Lei
  2023-12-06 15:31               ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2023-12-06  3:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> > > Good question. The "capable" check had always been first so even with
> > > the relaxed permissions, it was still paying the price. I have changed
> > > that order in commit staged here (not yet upstream):
> > > 
> > >   http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
> > 
> > With this change, I guess you shouldn't see the following big gap, right?
> 
> Correct.
>  
> > > Before: 970k IOPs
> > > After: 1750k IOPs
>  
> > > Note that only prevents the costly capable() check if the inexpensive
> > > checks could make a determination. That's still not solving the problem
> > > long term since we aim for forward compatibility where we have no idea
> > > which opcodes, admin identifications, or vendor specifics could be
> > > deemed "safe" for non-root users in the future, so those conditions
> > > would always fall back to the more expensive check that this patch was
> > > trying to mitigate for admin processes.
> > 
> > Not sure I get the idea, it is related with nvme's permission model for
> > user pt command, and:
> > 
> > 1) it should be always checked in entry of nvme user pt command
> > 
> > 2) only the following two types of commands require ADMIN, per commit
> > 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
> > 
> >     - any admin-cmd is not allowed
> >     - vendor-specific and fabric commmand are not allowed
> > 
> > Can you provide more details why the expensive check can't be avoided for
> > fast read/write user IO commands?
> 
> It's not necessarily about the read/write passthrough commands. It's for
> commands we don't know about today. Do we want to revisit this problem
> every time spec provides another operation? Are vendor unique solutions
> not allowed to get high IOPs access?

Except for read/write, what other commands are performance sensitive?

> 
> Secondly, some people have rediscovered you can abuse this interface to
> corrupt kernel memory, so there are considerations to restricting this

Just wondering why ADMIN won't corrupt kernel memory, and only normal
user can, looks it is kernel bug instead of permission related issue.

> to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> have to go that route.

If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
optimize it in task_struct?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-06  3:08             ` Ming Lei
@ 2023-12-06 15:31               ` Keith Busch
  2023-12-07  1:23                 ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2023-12-06 15:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote:
> On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> > 
> > It's not necessarily about the read/write passthrough commands. It's for
> > commands we don't know about today. Do we want to revisit this problem
> > every time spec provides another operation? Are vendor unique solutions
> > not allowed to get high IOPs access?
> 
> Except for read/write, what other commands are performance sensitive?

It varies by command set, but this question is irrelevant. I'm not
interested in gatekeeping the fast path.
 
> > Secondly, some people have rediscovered you can abuse this interface to
> > corrupt kernel memory, so there are considerations to restricting this
> 
> Just wondering why ADMIN won't corrupt kernel memory, and only normal
> user can, looks it is kernel bug instead of permission related issue.

Admin can corrupt memory as easily as a normal user through this
interface. We just don't want such capabilities to be available to
regular users.

And it's a user bug: user told the kernel to map buffer of size X, but
the device transfers size Y into it. Kernel can't do anything about that
(other than remove the interface, but such an action will break many
existing users) because we fundamentally do not know the true transfer
size of a random command. Many NVMe commands don't explicitly encode
transfer lengths, so disagreement between host and device on implicit
lengths risk corruption. It's a protocol "feature".

> > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> > have to go that route.
> 
> If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
> optimize it in task_struct?

That's an interesting point to look into. I was hoping to not touch such
a common struct, but I'm open to all options.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-06 15:31               ` Keith Busch
@ 2023-12-07  1:23                 ` Ming Lei
  2023-12-07 17:48                   ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2023-12-07  1:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
	asml.silence, linux-security-module, Kanchan Joshi

On Wed, Dec 06, 2023 at 08:31:54AM -0700, Keith Busch wrote:
> On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote:
> > On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> > > 
> > > It's not necessarily about the read/write passthrough commands. It's for
> > > commands we don't know about today. Do we want to revisit this problem
> > > every time spec provides another operation? Are vendor unique solutions
> > > not allowed to get high IOPs access?
> > 
> > Except for read/write, what other commands are performance sensitive?
> 
> It varies by command set, but this question is irrelevant. I'm not
> interested in gatekeeping the fast path.

IMO, it doesn't make sense to run such optimization for commands which aren't
performance sensitive.

>  
> > > Secondly, some people have rediscovered you can abuse this interface to
> > > corrupt kernel memory, so there are considerations to restricting this
> > 
> > Just wondering why ADMIN won't corrupt kernel memory, and only normal
> > user can, looks it is kernel bug instead of permission related issue.
> 
> Admin can corrupt memory as easily as a normal user through this
> interface. We just don't want such capabilities to be available to
> regular users.
> 
> And it's a user bug: user told the kernel to map buffer of size X, but
> the device transfers size Y into it. Kernel can't do anything about that
> (other than remove the interface, but such an action will break many
> existing users) because we fundamentally do not know the true transfer
> size of a random command. Many NVMe commands don't explicitly encode
> transfer lengths, so disagreement between host and device on implicit
> lengths risk corruption. It's a protocol "feature".

Got it, thanks for the explanation, and looks one big defect of
NVMe protocol or the device implementation.

> 
> > > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> > > have to go that route.
> > 
> > If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
> > optimize it in task_struct?
> 
> That's an interesting point to look into. I was hoping to not touch such
> a common struct, but I'm open to all options.
 
capability is per-thread, and it is updated in current process/pthread, so
the correct place to cache this info is 'task_struct'.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] iouring: one capable call per iouring instance
  2023-12-07  1:23                 ` Ming Lei
@ 2023-12-07 17:48                   ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2023-12-07 17:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe,
	hch, sagi, asml.silence, linux-security-module, Kanchan Joshi

On Thu, Dec 07, 2023 at 09:23:06AM +0800, Ming Lei wrote:
> Got it, thanks for the explanation, and looks one big defect of
> NVMe protocol or the device implementation.

It is.  And NVMe has plenty more problems like that and people keep
adding this kind of crap even today :(

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-12-07 17:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20231204175342.3418422-1-kbusch@meta.com>
2023-12-04 18:40 ` [PATCH 1/2] iouring: one capable call per iouring instance Jeff Moyer
2023-12-04 18:57   ` Keith Busch
2023-12-05  4:14     ` Ming Lei
2023-12-05  4:31       ` Keith Busch
2023-12-05  5:25         ` Ming Lei
2023-12-05 15:45           ` Keith Busch
2023-12-06  3:08             ` Ming Lei
2023-12-06 15:31               ` Keith Busch
2023-12-07  1:23                 ` Ming Lei
2023-12-07 17:48                   ` Christoph Hellwig
2023-12-04 19:01   ` Jens Axboe
2023-12-04 19:22     ` Jeff Moyer
2023-12-04 19:33       ` Jens Axboe
2023-12-04 19:37       ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).