nvme-fabrics: devices are uninterruptable

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* nvme-fabrics: devices are uninterruptable
@ 2023-01-11 14:37 ` Belanger, Martin
  2023-01-12  5:36   ` Kanchan Joshi
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Belanger, Martin @ 2023-01-11 14:37 UTC (permalink / raw)
  To: linux-nvme@lists.infradead.org
  Cc: Hannes Reinecke, Daniel Wagner, smith, erik, Ghalam, Joe,
	Hayes, Stuart, White, Joseph L, Glimcher, Boris

POSIX.1 specifies that certain functions such as read() or write() can act as cancellation points.

Ref: https://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_02

Cancellation point functions can be forced to terminate before completion. Typically, sending a signal to a process/thread will cause cancellation point functions to return immediately with an error (e.g. -1) and with errno set to EINTR. For example, if a read() is currently blocked on a socket, and the process/thread receives a signal, then read() will return -1 and errno will be set to EINTR. At this point the process/thread has the option of ignoring errno==EINTR and resume the read() operation, or can decide to exit() if the signal received matches a specific type such as SIGINT (CTRL-C) or SIGTERM. To do that, the process/thread can use a signal handler that caches the signal type received so that when control is returned to the process/thread it can query which signal type was received and act accordingly when errno==EINTR.

The nvme driver does not seem to allow cancellation points. In other words, processes/threads blocked on read()/write() associated with a nvme device (e.g. /dev/nvme-fabrics, /sys/class/nvme/nvme0/delete_controller) cannot be interrupted by signals. This can be problematic especially for the following cases: 

1) When scaling to a large number of connections (N), applications may be blocked on /dev/nvme-fabrics for long periods of time. Creating a connection to a controller is typically very fast (msec). However, if connectivity is down (e.g. networking issues) it takes about 3 seconds for the kernel to return with an error message indicating that the connection has failed. Let's say we want to create N=100 connections while connectivity is down. Because /dev/nvme-fabrics only allows one connection request at a time, it will take 3 * N = 300 seconds (5 minutes) before all connection requests get processed by the kernel. If multiple processes/threads request connections in parallel, they will all be blocked (except for 1) trying to write to /dev/nvme-fabrics. And there is no way to stop/cancel a process/thread once it is blocked on /dev/nvme-fabrics. Signals, including SIGKILL, have no effect whatsoever.

2) Similarly, deleting a controller by writing "1" to the "delete_controller" device while connectivity to that controller is down will block the calling process/thread for 1 minute (built-in timeout waiting for a response). While blocked, there is no way to terminate the process/thread. SIGINT (CTRL-C), SIGTERM, or even SIGKILL have no effect.

I wanted to ask the community if there is a reason for the nvme driver to not support POSIX cancellation points? I also wanted to know whether it would be possible to add support for it? Is there a downside to doing so? 

Regards,
Martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nvme-fabrics: devices are uninterruptable
  2023-01-11 14:37 ` nvme-fabrics: devices are uninterruptable Belanger, Martin
@ 2023-01-12  5:36   ` Kanchan Joshi
  2023-01-12 10:38     ` Hannes Reinecke
  2023-01-13 11:26   ` Martin Wilck
  2023-01-17 10:51   ` Christoph Hellwig
  2 siblings, 1 reply; 7+ messages in thread
From: Kanchan Joshi @ 2023-01-12  5:36 UTC (permalink / raw)
  To: Belanger, Martin
  Cc: linux-nvme@lists.infradead.org, Hannes Reinecke, Daniel Wagner,
	smith, erik, Ghalam, Joe, Hayes, Stuart, White, Joseph L,
	Glimcher, Boris, axboe

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

On Wed, Jan 11, 2023 at 02:37:58PM +0000, Belanger, Martin wrote:
>POSIX.1 specifies that certain functions such as read() or write() can act as cancellation points.
>
>Ref: https://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_02

Not exactly related to the original question, but I hear the
cancellation requirement for passthrough too.

And it seems feasible as io_uring provides cancellation interface
(IORING_OP_ASYNC_CANCEL) to user-space.
It is only at io_uring level, and does not percolate down to lower
layers (as we don't have an interface).
Will it make sense to grow such interface for uring_cmd. Either
a new file-op ->uring_cmd_cancel or existing ->uring_cmd with new
cancel-flag. 
Not sure if ublk too needs it, but NVMe can support this (new op/flag)
cancellation by issuing abort-command to the device.

Down in nvme, we would need command-id/queue-id to issue abort command,
and that maybe tricky to store (although it is something we do for
iopoll). Maybe I can figure something out while cooking up a RFC.

But first things first, 
Christoph, Jens: does this sound reasonable?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nvme-fabrics: devices are uninterruptable
  2023-01-12  5:36   ` Kanchan Joshi
@ 2023-01-12 10:38     ` Hannes Reinecke
  0 siblings, 0 replies; 7+ messages in thread
From: Hannes Reinecke @ 2023-01-12 10:38 UTC (permalink / raw)
  To: Kanchan Joshi, Belanger, Martin
  Cc: linux-nvme@lists.infradead.org, Daniel Wagner, smith, erik,
	Ghalam, Joe, Hayes, Stuart, White, Joseph L, Glimcher, Boris,
	axboe

On 1/12/23 06:36, Kanchan Joshi wrote:
> On Wed, Jan 11, 2023 at 02:37:58PM +0000, Belanger, Martin wrote:
>> POSIX.1 specifies that certain functions such as read() or write() can 
>> act as cancellation points.
>>
>> Ref: 
>> https://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_02
> 
> Not exactly related to the original question, but I hear the
> cancellation requirement for passthrough too.
> 
> And it seems feasible as io_uring provides cancellation interface
> (IORING_OP_ASYNC_CANCEL) to user-space.
> It is only at io_uring level, and does not percolate down to lower
> layers (as we don't have an interface).
> Will it make sense to grow such interface for uring_cmd. Either
> a new file-op ->uring_cmd_cancel or existing ->uring_cmd with new
> cancel-flag. Not sure if ublk too needs it, but NVMe can support this 
> (new op/flag)
> cancellation by issuing abort-command to the device.
> 
> Down in nvme, we would need command-id/queue-id to issue abort command,
> and that maybe tricky to store (although it is something we do for
> iopoll). Maybe I can figure something out while cooking up a RFC.
> 
> But first things first, Christoph, Jens: does this sound reasonable?
> 
Command aborts for io_uring! Yay!

Something I have been needing for since ages immemorial; KVM suffers 
from this as we can't do a command abort from the guest.
Meaning we always will have to wait for command completion _from the 
host_, making it impossible to trigger a failover or even start error 
recovery.

So, yes, please. We should at least look at it, and io_uring looks 
ideally suited.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nvme-fabrics: devices are uninterruptable
  2023-01-11 14:37 ` nvme-fabrics: devices are uninterruptable Belanger, Martin
  2023-01-12  5:36   ` Kanchan Joshi
@ 2023-01-13 11:26   ` Martin Wilck
  2023-01-13 16:58     ` Belanger, Martin
  2023-01-17 10:51   ` Christoph Hellwig
  2 siblings, 1 reply; 7+ messages in thread
From: Martin Wilck @ 2023-01-13 11:26 UTC (permalink / raw)
  To: linux-nvme; +Cc: Belanger, Martin

On Wed, 2023-01-11 at 14:37 +0000, Belanger, Martin wrote:
> POSIX.1 specifies that certain functions such as read() or write()
> can act as cancellation points.
> 
> Ref:
> https://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_02
> 
> Cancellation point functions can be forced to terminate before
> completion.

I think you are confusing things here. The page you mention is about
pthreads. pthread cancellation points are points at which a
pthread_cancel() call from another will interrupt a thread that is
using PTHREAD_CANCEL_DEFERRED cancellability, and nothing more. The
"cancellation point" logic applies *only* to the specific signal that
is used for implementing pthread_cancel(). It has nothing to do with
the cancellation of I/O requests. The spec says nothing about the
semantics of cancelling I/O system calls. Usually the thread
cancellation will occur either before entering or after returning from
the system call, rather than interrupting it. The general semantics of
signal delivery apply.

>  Typically, sending a signal to a process/thread will cause
> cancellation point functions to return immediately with an error
> (e.g. -1) and with errno set to EINTR. [...]
> 
> The nvme driver does not seem to allow cancellation points. In other
> words, processes/threads blocked on read()/write() associated with a
> nvme device (e.g. /dev/nvme-fabrics,
> /sys/class/nvme/nvme0/delete_controller) cannot be interrupted by
> signals. This can be problematic especially for the following cases: 

What you actually want to refer to is (I think) the section about
"Interruption of system calls and library functions by signal handlers"
in signal(7): "If  a  blocked  call to one of the following interfaces
is interrupted by a signal handler, then [...] the call fails with the
error EINTR: ... read(2), readv(2), write(2), writev(2), and ioctl(2)
calls on 'slow' devices." Note that this paragraph goes on saying that
"a (local) disk is not a slow device according to this definition; I/O
operations on disk devices are not interrupted by signals." I assume
the last sentence applies to NVMe disks, too. nvme-fabrics is a
different topic; one could argue it should have socket-like semantics
(and socket IO _is_ interrupted with EINTR, same man page section).

> 1) When scaling to a large number of connections (N), applications
> may be blocked on /dev/nvme-fabrics for long periods of time.
> Creating a connection to a controller is typically very fast (msec).
> However, if connectivity is down (e.g. networking issues) it takes
> about 3 seconds for the kernel to return with an error message
> indicating that the connection has failed. Let's say we want to
> create N=100 connections while connectivity is down. Because
> /dev/nvme-fabrics only allows one connection request at a time, it
> will take 3 * N = 300 seconds (5 minutes) before all connection
> requests get processed by the kernel. If multiple processes/threads
> request connections in parallel, they will all be blocked (except for
> 1) trying to write to /dev/nvme-fabrics. And there is no way to
> stop/cancel a process/thread once it is blocked on /dev/nvme-fabrics.
> Signals, including SIGKILL, have no effect whatsoever.

I think that SIGKILL does have an effect; it will at turn the affected
process into a zombie. See above for nvme-fabrics.

> 2) Similarly, deleting a controller by writing "1" to the
> "delete_controller" device while connectivity to that controller is
> down will block the calling process/thread for 1 minute (built-in
> timeout waiting for a response). While blocked, there is no way to
> terminate the process/thread. SIGINT (CTRL-C), SIGTERM, or even
> SIGKILL have no effect.
> 
> I wanted to ask the community if there is a reason for the nvme
> driver to not support POSIX cancellation points? I also wanted to
> know whether it would be possible to add support for it? Is there a
> downside to doing so? 

Repeat, this has nothing to do with cancellation points.

Martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: nvme-fabrics: devices are uninterruptable
  2023-01-13 11:26   ` Martin Wilck
@ 2023-01-13 16:58     ` Belanger, Martin
  0 siblings, 0 replies; 7+ messages in thread
From: Belanger, Martin @ 2023-01-13 16:58 UTC (permalink / raw)
  To: Martin Wilck, linux-nvme@lists.infradead.org

> On Wed, 2023-01-11 at 14:37 +0000, Belanger, Martin wrote:
> > POSIX.1 specifies that certain functions such as read() or write() can
> > act as cancellation points.
> >
> > Ref:
> > https://urldefense.com/v3/__https://pubs.opengroup.org/onlinepubs/0000
> >
> 95399/functions/xsh_chap02_09.html*tag_02_09_05_02__;Iw!!LpKI!mRHjcC
> Da
> > 9GbhJbYIEQbL0gDby2fJRUEoRclC6iPBgKoy-serG9wheVae8Bfg8xF3-
> 6PccKiHA5iKfu
> > Y$ [pubs[.]opengroup[.]org]
> >
> > Cancellation point functions can be forced to terminate before
> > completion.
> 
> I think you are confusing things here. The page you mention is about pthreads.
> pthread cancellation points are points at which a
> pthread_cancel() call from another will interrupt a thread that is using
> PTHREAD_CANCEL_DEFERRED cancellability, and nothing more. The
> "cancellation point" logic applies *only* to the specific signal that is used for
> implementing pthread_cancel(). It has nothing to do with the cancellation of
> I/O requests. The spec says nothing about the semantics of cancelling I/O
> system calls. Usually the thread cancellation will occur either before entering or
> after returning from the system call, rather than interrupting it. The general
> semantics of signal delivery apply.
Hi Martin. That's for your response.

Agreed. I should not have used the "cancellation point" terminology. Instead, I could simply have said that many system calls will report the EINTR error code if a signal occurred while the system call was in progress. I've used this with dozens of projects in the past and it has always worked flawlessly (i.e. writing to sockets).

The documentation says that a blocked write() may return a number of bytes less than the specified count if, among other things, the call was interrupted by a signal after it had transferred some, but before it had transferred all the requested bytes.

Similarly, read() can return with a number of bytes smaller than the requested number if read() was interrupted by a signal.

I understand that /dev/nvme-fabrics is a special kind of file. It's not a regular file and it's not a socket. However, it should be possible to interrupt a process that is currently pending to write() to /dev/nvme-fabrics before any bytes have been written. Once bytes have actually been written and the kernel has started processing the connection request, I recognize that interrupting the write() operation at that point is not desirable. However, if we have several processes/threads currently pending to write() to /dev/nvme-fabrics because another process is currently being served, and before they have a chance to write any bytes, it is perfectly reasonable to allow a signal to interrupt the write() and allow these processes to exit, no harm done.

Another thing that I've been wondering is why the kernel does not allow multiple connection requests in parallel? It should be possible for multiple processes to write commands to /dev/nvme-fabrics concurrently. I mean, each process needs to open() /dev/nvme-fabrics, which gives them their own file descriptor. Then they can write() to or read() from that file descriptor independent of what other processes are doing. This would prevent processes from being blocked for long periods of time like the 100 connection example I mentioned earlier. In other words, 100 connection requests could be made in parallel, all of them timing out at the same time after 3 seconds (instead of 5 minutes).

> 
> >  Typically, sending a signal to a process/thread will cause
> > cancellation point functions to return immediately with an error (e.g.
> > -1) and with errno set to EINTR. [...]
> >
> > The nvme driver does not seem to allow cancellation points. In other
> > words, processes/threads blocked on read()/write() associated with a
> > nvme device (e.g. /dev/nvme-fabrics,
> > /sys/class/nvme/nvme0/delete_controller) cannot be interrupted by
> > signals. This can be problematic especially for the following cases:
> 
> What you actually want to refer to is (I think) the section about "Interruption of
> system calls and library functions by signal handlers"
> in signal(7): "If  a  blocked  call to one of the following interfaces is interrupted
> by a signal handler, then [...] the call fails with the error EINTR: ... read(2),
> readv(2), write(2), writev(2), and ioctl(2) calls on 'slow' devices." Note that this
> paragraph goes on saying that "a (local) disk is not a slow device according to
> this definition; I/O operations on disk devices are not interrupted by signals." I
> assume the last sentence applies to NVMe disks, too. nvme-fabrics is a
> different topic; one could argue it should have socket-like semantics (and
> socket IO _is_ interrupted with EINTR, same man page section).

Exactly!

> 
> > 1) When scaling to a large number of connections (N), applications may
> > be blocked on /dev/nvme-fabrics for long periods of time.
> > Creating a connection to a controller is typically very fast (msec).
> > However, if connectivity is down (e.g. networking issues) it takes
> > about 3 seconds for the kernel to return with an error message
> > indicating that the connection has failed. Let's say we want to create
> > N=100 connections while connectivity is down. Because
> > /dev/nvme-fabrics only allows one connection request at a time, it
> > will take 3 * N = 300 seconds (5 minutes) before all connection
> > requests get processed by the kernel. If multiple processes/threads
> > request connections in parallel, they will all be blocked (except for
> > 1) trying to write to /dev/nvme-fabrics. And there is no way to
> > stop/cancel a process/thread once it is blocked on /dev/nvme-fabrics.
> > Signals, including SIGKILL, have no effect whatsoever.
> 
> I think that SIGKILL does have an effect; it will at turn the affected process into
> a zombie. See above for nvme-fabrics.

I'll have to check that again. Last time I tried I did not see any effect with "kill -9".

> 
> 
> > 2) Similarly, deleting a controller by writing "1" to the
> > "delete_controller" device while connectivity to that controller is
> > down will block the calling process/thread for 1 minute (built-in
> > timeout waiting for a response). While blocked, there is no way to
> > terminate the process/thread. SIGINT (CTRL-C), SIGTERM, or even
> > SIGKILL have no effect.
> >
> > I wanted to ask the community if there is a reason for the nvme driver
> > to not support POSIX cancellation points? I also wanted to know
> > whether it would be possible to add support for it? Is there a
> > downside to doing so?
> 
> Repeat, this has nothing to do with cancellation points.
> 
> Martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nvme-fabrics: devices are uninterruptable
  2023-01-11 14:37 ` nvme-fabrics: devices are uninterruptable Belanger, Martin
  2023-01-12  5:36   ` Kanchan Joshi
  2023-01-13 11:26   ` Martin Wilck
@ 2023-01-17 10:51   ` Christoph Hellwig
  2023-01-17 15:28     ` Belanger, Martin
  2 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2023-01-17 10:51 UTC (permalink / raw)
  To: Belanger, Martin
  Cc: linux-nvme@lists.infradead.org, Hannes Reinecke, Daniel Wagner,
	smith, erik, Ghalam, Joe, Hayes, Stuart, White, Joseph L,
	Glimcher, Boris

On Wed, Jan 11, 2023 at 02:37:58PM +0000, Belanger, Martin wrote:
> POSIX.1 specifies that certain functions such as read() or write() can act as cancellation points.

device special files are mostly out of scope for the normal Posix
rules..

> I wanted to ask the community if there is a reason for the nvme driver to not support POSIX cancellation points? I also wanted to know whether it would be possible to add support for it? Is there a downside to doing so? 

How do you propose to allow for safe interruption?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: nvme-fabrics: devices are uninterruptable
  2023-01-17 10:51   ` Christoph Hellwig
@ 2023-01-17 15:28     ` Belanger, Martin
  0 siblings, 0 replies; 7+ messages in thread
From: Belanger, Martin @ 2023-01-17 15:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme@lists.infradead.org, Hannes Reinecke, Daniel Wagner,
	smith, erik, Ghalam, Joe, Hayes, Stuart, White, Joseph L,
	Glimcher, Boris

> On Wed, Jan 11, 2023 at 02:37:58PM +0000, Belanger, Martin wrote:
> > POSIX.1 specifies that certain functions such as read() or write() can act as
> cancellation points.
> 
> device special files are mostly out of scope for the normal Posix rules..
> 
> > I wanted to ask the community if there is a reason for the nvme driver to not
> support POSIX cancellation points? I also wanted to know whether it would be
> possible to add support for it? Is there a downside to doing so?
> 
> How do you propose to allow for safe interruption?

Hi Christoph,

Connection requests that are pending because the kernel is currently busy working on another connection request should be cancellable. I agree that once the kernel starts processing a connection request, then that connection request can no longer be cancelled. It would be too complex to cleanly interrupt a connection request midflight.

On the other hand, if the kernel allowed all the connection requests to be processed concurrently such that no connection request gets delayed by another one, then there would be no need for cancellation support. 

This is only a problem when large numbers of connection requests are made at the same time while there is connectivity issues. That's because a failing connection blocks the /dev/nvme-fabrics interface for about 3 seconds. For large numbers of failing connections the interface can block for long periods of time (it only takes 20 failing connections to make the interface busy for a whole minute). Allowing multiple connection requests in parallel would reduce the amount of blocking since all the failing connection requests would not get in the way of the successful ones. It also means that all the failing connection requests would be reported as failed more or less at the same after 3 seconds instead of being reported one at a time once every 3 seconds.

Regards,
Martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-17 15:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20230111144340epcas5p3940d3fbf8bd19771bf1fb55ba56ecf95@epcas5p3.samsung.com>
2023-01-11 14:37 ` nvme-fabrics: devices are uninterruptable Belanger, Martin
2023-01-12  5:36   ` Kanchan Joshi
2023-01-12 10:38     ` Hannes Reinecke
2023-01-13 11:26   ` Martin Wilck
2023-01-13 16:58     ` Belanger, Martin
2023-01-17 10:51   ` Christoph Hellwig
2023-01-17 15:28     ` Belanger, Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox