Reconnect on RDMA device reset

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Reconnect on RDMA device reset
@ 2018-01-22 15:07 Oren Duer
  2018-01-23 12:42 ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Oren Duer @ 2018-01-22 15:07 UTC (permalink / raw)


Hi,

Today host and target stacks will respond to RDMA device reset (or plug out
and plug in) by cleaning all resources related to that device, and sitting
idle waiting for administrator intervention to reconnect (host stack) or
rebind subsystem to a port (target stack).

I'm thinking that maybe the right behaviour should be to try and restore
everything as soon as the device becomes available again. I don't think a
device reset should look different to the users than ports going down and up
again.

At the host stack we already have a reconnect flow (which works great when
ports go down and back up). Instead of registering to ib_client callback
rdma_remove_one and clean up everything, we could respond to the
RDMA_CM_EVENT_DEVICE_REMOVAL event and go into that reconnect flow.

In the reconnect flow the stack already repeats creating the cm_id and
resolving address and route, so when the RDMA device comes back up, and
assuming it will be configured with the same address and connected to the same
network (as is the case in device reset), connections will be restored
automatically.

At the target stack things are even worse. When the RDMA device resets or
disappears the softlink between the port and the subsystem stays "hanging". It
does not represent an active bind, and when the device will come back with the
same address and network it will not start working (even though the softlink
is there). This is quite confusing to the user.

What I suggest here is to implement something similar to the reconnect flow at
the host, and repeat the flow that is doing the rdma_bind_addr. This way,
again, when the device will come back with the same address and network the
bind will succeed and the subsystem will become functional again. In this case
it makes sense to keep the softlink during all this time, as the stack really
tries to re-bind to the port.

These changes also clean the code as RDMA_CM applications should not be
registering as ib_clients in the first place...

Thoughts?

Oren Duer
Mellanox Technologies

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-22 15:07 Reconnect on RDMA device reset Oren Duer
@ 2018-01-23 12:42 ` Sagi Grimberg
  2018-01-24  7:41   ` Oren Duer
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2018-01-23 12:42 UTC (permalink / raw)



> Hi,

Hey Oren,

> Today host and target stacks will respond to RDMA device reset (or plug out
> and plug in) by cleaning all resources related to that device, and sitting
> idle waiting for administrator intervention to reconnect (host stack) or
> rebind subsystem to a port (target stack).
> 
> I'm thinking that maybe the right behaviour should be to try and restore
> everything as soon as the device becomes available again. I don't think a
> device reset should look different to the users than ports going down and up
> again.

Hmm, not sure I fully agree here. In my mind device removal means the
device is going away which means there is no point in keeping the 
controller around...

AFAIK device resets usually are expected to quiesce inflight I/O,
cleanup resources and restore when the reset sequence completes (which 
is what we do in nvme controller resets). I'm not sure I understand why
RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
rdma_cm or .remove_one via ib_client API). I think the correct interface
would be suspend/resume semantics for RDMA device resets (similar to pm
interface).

I think that it would make a much cleaner semantics and ULPs should be
able to understand exactly what to do (which is what you suggested
above).

CCing linux-rdma.

> At the host stack we already have a reconnect flow (which works great when
> ports go down and back up). Instead of registering to ib_client callback
> rdma_remove_one and clean up everything, we could respond to the
> RDMA_CM_EVENT_DEVICE_REMOVAL event and go into that reconnect flow.

Regardless of ib_client vs. rdma_cm, we can't simply perform normal
reconnects because we have dma mappings we need to unmap for each
request in the tagset which we don't teardown in every reconnect (as
we may have inflight I/O). We could have theoretically use reinit_tagset
to do that though.

Personally I think ib_client is much better than the rdma_cm
DEVICE_REMOVAL event interface because:
(1) rdma_cm is per cm_id which means we effectively only react to the
first one and the rest are nops which is a bit awkward
(2) it requires special handling for resource cleanup with respect to
the cm_id removal which must be destroyed within the DEVICE_REMOVAL
event by returning non-zero return value from the event handler (as
rdma_destroy_id() would block from the event_handler context) and
must not be done in the removal sequence (which is the normal flow).

Both of these make unnecessary complications which are much cleaner
with ib_client interface. see Steve's commit e87a911fed07 ("nvme-rdma: 
use ib_client API to detect device removal")

> 
> In the reconnect flow the stack already repeats creating the cm_id and
> resolving address and route, so when the RDMA device comes back up, and
> assuming it will be configured with the same address and connected to the same
> network (as is the case in device reset), connections will be restored
> automatically.

As I said, I think that the problem is the interface of RDMA device
resets. IMO, device removal means we need to delete all the nvme
controllers associated with the device.

If we were to handle hotplug events where devices come into the system,
the correct way would be to send a udev event to userspace and not keep
stale controllers around with hope they will come back. userspace is a
much better place to keep a state with respect to these scenarios IMO.

> At the target stack things are even worse. When the RDMA device resets or
> disappears the softlink between the port and the subsystem stays "hanging". It
> does not represent an active bind, and when the device will come back with the
> same address and network it will not start working (even though the softlink
> is there). This is quite confusing to the user.

Right, I think we would need to reflect port state (active/inactive) via
configfs and nvmetcli could reflect it in its UI.

> What I suggest here is to implement something similar to the reconnect flow at
> the host, and repeat the flow that is doing the rdma_bind_addr. This way,
> again, when the device will come back with the same address and network the
> bind will succeed and the subsystem will become functional again. In this case
> it makes sense to keep the softlink during all this time, as the stack really
> tries to re-bind to the port.

I'm sorry but I don't think that is the correct approach. If the device
is removed than we break the association and do nothing else. As for
RDMA device resets, this goes back to the interface problem I pointed
out.

> These changes also clean the code as RDMA_CM applications should not be
> registering as ib_clients in the first place...

I don't think that there is a problem in rdma_cm applications
registering to the ib_client API.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-23 12:42 ` Sagi Grimberg
@ 2018-01-24  7:41   ` Oren Duer
  2018-01-24 20:52     ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Oren Duer @ 2018-01-24  7:41 UTC (permalink / raw)

On Tue, Jan 23, 2018@2:42 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Hi,
>
>
> Hey Oren,
>
>> Today host and target stacks will respond to RDMA device reset (or plug
>> out
>> and plug in) by cleaning all resources related to that device, and sitting
>> idle waiting for administrator intervention to reconnect (host stack) or
>> rebind subsystem to a port (target stack).
>>
>> I'm thinking that maybe the right behaviour should be to try and restore
>> everything as soon as the device becomes available again. I don't think a
>> device reset should look different to the users than ports going down and
>> up
>> again.
>
>
> Hmm, not sure I fully agree here. In my mind device removal means the
> device is going away which means there is no point in keeping the controller
> around...

The same could have been said on a port going down. You don't know if it will
come back up connected to the same network...

>
> AFAIK device resets usually are expected to quiesce inflight I/O,
> cleanup resources and restore when the reset sequence completes (which is
> what we do in nvme controller resets). I'm not sure I understand why
> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
> rdma_cm or .remove_one via ib_client API). I think the correct interface
> would be suspend/resume semantics for RDMA device resets (similar to pm
> interface).
>
> I think that it would make a much cleaner semantics and ULPs should be
> able to understand exactly what to do (which is what you suggested
> above).
>
> CCing linux-rdma.

Maybe so. I don't know what's the "standard" here for Linux in general and
networking devices in particular. Let's see if linux-rdma agree here.

> Regardless of ib_client vs. rdma_cm, we can't simply perform normal
> reconnects because we have dma mappings we need to unmap for each
> request in the tagset which we don't teardown in every reconnect (as
> we may have inflight I/O). We could have theoretically use reinit_tagset
> to do that though.

Obviously it isn't that simple... Just trying to agree on the right direction
to go.

>>
>> In the reconnect flow the stack already repeats creating the cm_id and
>> resolving address and route, so when the RDMA device comes back up, and
>> assuming it will be configured with the same address and connected to the
>> same
>> network (as is the case in device reset), connections will be restored
>> automatically.
>
>
> As I said, I think that the problem is the interface of RDMA device
> resets. IMO, device removal means we need to delete all the nvme
> controllers associated with the device.

Do you think all associated controllers should be deleted when a TCP socket
gets disconnected in NVMe-over-TCP? Do they?

>
> If we were to handle hotplug events where devices come into the system,
> the correct way would be to send a udev event to userspace and not keep
> stale controllers around with hope they will come back. userspace is a
> much better place to keep a state with respect to these scenarios IMO.

That's the important part I'm trying to understand the direction we should go.
First, let's agree that the user (admin) expects a simple behaviour: if a
configuration was made to connect with a remote storage, the stack (driver,
daemons, scripts) should make an effort to keep those connections whenever
possible.

Yes, it could be a userspace script/daemon job. But I was under the impression
that this group tries to consolidate most (all?) of the functionality into the
driver, and not rely on userspace daemons. Maybe a lesson learnt from iSCSI?
If all are in agreement that this should be done in userspace, that's fine.

>> At the target stack things are even worse. When the RDMA device resets or
>> disappears the softlink between the port and the subsystem stays
>> "hanging". It
>> does not represent an active bind, and when the device will come back with
>> the
>> same address and network it will not start working (even though the
>> softlink
>> is there). This is quite confusing to the user.
>
>
> Right, I think we would need to reflect port state (active/inactive) via
> configfs and nvmetcli could reflect it in its UI.

You mean the softlink should disappear in this case?
It can't stay as it means nothing (the bond between the port and the subsystem
is gone forever the way it is now).
But removing the softlink in configfs sounds against nature of things: the
admin put it there, it reflects the admin wish to expose a subsystem via a
port. This wish did not change... Are there examples of configfs items being
changed by the stack against the admin's wish?

>> What I suggest here is to implement something similar to the reconnect
>> flow at
>> the host, and repeat the flow that is doing the rdma_bind_addr. This way,
>> again, when the device will come back with the same address and network
>> the
>> bind will succeed and the subsystem will become functional again. In this
>> case
>> it makes sense to keep the softlink during all this time, as the stack
>> really
>> tries to re-bind to the port.
>
>
> I'm sorry but I don't think that is the correct approach. If the device
> is removed than we break the association and do nothing else. As for
> RDMA device resets, this goes back to the interface problem I pointed
> out.

Are we in agreement that the user (admin) expects the software stack to keep
this bound when possible (like keeping the connections in the initiator case)?
After all, the admin has specifically put the softlink there - it expresses
the admin's wish.

We could agree here too that it is the task of a userspace daemon/script. But
then we'll need to keep the entire configuration in another place (meaning
configfs alone is not enough anymore), constantly compare it to the current
configuration in configfs, and make the adjustments.
And we'll need the stack to remove the symlink, which I still think is an odd
behaviour.

Oren

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-24  7:41   ` Oren Duer
@ 2018-01-24 20:52     ` Sagi Grimberg
  2018-01-25 14:10       ` Oren Duer
  2018-01-25 18:13       ` Doug Ledford
  0 siblings, 2 replies; 19+ messages in thread
From: Sagi Grimberg @ 2018-01-24 20:52 UTC (permalink / raw)



>>> Today host and target stacks will respond to RDMA device reset (or plug
>>> out
>>> and plug in) by cleaning all resources related to that device, and sitting
>>> idle waiting for administrator intervention to reconnect (host stack) or
>>> rebind subsystem to a port (target stack).
>>>
>>> I'm thinking that maybe the right behaviour should be to try and restore
>>> everything as soon as the device becomes available again. I don't think a
>>> device reset should look different to the users than ports going down and
>>> up
>>> again.
>>
>>
>> Hmm, not sure I fully agree here. In my mind device removal means the
>> device is going away which means there is no point in keeping the controller
>> around...
> 
> The same could have been said on a port going down. You don't know if it will
> come back up connected to the same network...

That's true. However in my mind port events are considered transient,
and we do give up at some point. I'm simply arguing that device removal
has different semantics. I don't argue that we need to support it.

>> AFAIK device resets usually are expected to quiesce inflight I/O,
>> cleanup resources and restore when the reset sequence completes (which is
>> what we do in nvme controller resets). I'm not sure I understand why
>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
>> rdma_cm or .remove_one via ib_client API). I think the correct interface
>> would be suspend/resume semantics for RDMA device resets (similar to pm
>> interface).
>>
>> I think that it would make a much cleaner semantics and ULPs should be
>> able to understand exactly what to do (which is what you suggested
>> above).
>>
>> CCing linux-rdma.
> 
> Maybe so. I don't know what's the "standard" here for Linux in general and
> networking devices in particular. Let's see if linux-rdma agree here.

I would like to hear more opinions on the current interface.

>> Regardless of ib_client vs. rdma_cm, we can't simply perform normal
>> reconnects because we have dma mappings we need to unmap for each
>> request in the tagset which we don't teardown in every reconnect (as
>> we may have inflight I/O). We could have theoretically use reinit_tagset
>> to do that though.
> 
> Obviously it isn't that simple... Just trying to agree on the right direction
> to go.

Yea, I agree. It shouldn't be too hard also.

>>> In the reconnect flow the stack already repeats creating the cm_id and
>>> resolving address and route, so when the RDMA device comes back up, and
>>> assuming it will be configured with the same address and connected to the
>>> same
>>> network (as is the case in device reset), connections will be restored
>>> automatically.
>>
>>
>> As I said, I think that the problem is the interface of RDMA device
>> resets. IMO, device removal means we need to delete all the nvme
>> controllers associated with the device.
> 
> Do you think all associated controllers should be deleted when a TCP socket
> gets disconnected in NVMe-over-TCP? Do they?

Nope, but that is equivalent to QP going into error state IMO, and we
don't do that in nvme-rdma as well.

There is a slight difference as tcp controllers are not responsible for
releasing any HW resource nor standing in the way of the device to reset
itself. In RDMA, the ULP needs to cooperate with the stack, so I think
it would be better if the interface would map better to a reset process
(i.e. transient).

>> If we were to handle hotplug events where devices come into the system,
>> the correct way would be to send a udev event to userspace and not keep
>> stale controllers around with hope they will come back. userspace is a
>> much better place to keep a state with respect to these scenarios IMO.
> 
> That's the important part I'm trying to understand the direction we should go.
> First, let's agree that the user (admin) expects a simple behaviour:

No argues here..

> if a configuration was made to connect with a remote storage, the stack (driver,
> daemons, scripts) should make an effort to keep those connections whenever
> possible.

True, and in fact Johannes suggested a related topic for LSF:
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015159.html

For now, we don't have a good way to auto-connect (or auto-reconnect)
for IP based nvme transports.

> Yes, it could be a userspace script/daemon job. But I was under the impression
> that this group tries to consolidate most (all?) of the functionality into the
> driver, and not rely on userspace daemons. Maybe a lesson learnt from iSCSI?

Indeed that is a guideline that was taken early on. But
auto-connect/auto-discovery is not something I think we'd like to
implement in the kernel...

> You mean the softlink should disappear in this case?
> It can't stay as it means nothing (the bond between the port and the subsystem
> is gone forever the way it is now).

I meant that we expose a port state via configfs. As for device hotplug,
maybe the individual transports can propagate udev event to userspace to
try to re-enable the port or something... Don't have it all figured
out..

>>> What I suggest here is to implement something similar to the reconnect
>>> flow at
>>> the host, and repeat the flow that is doing the rdma_bind_addr. This way,
>>> again, when the device will come back with the same address and network
>>> the
>>> bind will succeed and the subsystem will become functional again. In this
>>> case
>>> it makes sense to keep the softlink during all this time, as the stack
>>> really
>>> tries to re-bind to the port.
>>
>>
>> I'm sorry but I don't think that is the correct approach. If the device
>> is removed than we break the association and do nothing else. As for
>> RDMA device resets, this goes back to the interface problem I pointed
>> out.
> 
> Are we in agreement that the user (admin) expects the software stack to keep
> this bound when possible (like keeping the connections in the initiator case)?
> After all, the admin has specifically put the softlink there - it expresses
> the admin's wish.

It will be the case if port binds to INADDR_ANY :)

Anyways, I think we agree here (at least partially). I think that we
need to reflect port state in configfs (nvmetcli can color it red),
and when a device completes reset sequence we get an event that tells
us just that we we send it to userspace and re-enable the port...

> We could agree here too that it is the task of a userspace daemon/script. But
> then we'll need to keep the entire configuration in another place (meaning
> configfs alone is not enough anymore),

We have nvmetcli for that. we just need a reactor to udev.

> constantly compare it to the current configuration in configfs, and make the adjustments.

I would say that we should have it driven from changes from the
kernel...

> And we'll need the stack to remove the symlink, which I still think is an odd
> behaviour.

No need to remove the symlink.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-24 20:52     ` Sagi Grimberg
@ 2018-01-25 14:10       ` Oren Duer
  2018-01-29 19:58         ` Sagi Grimberg
  2018-01-25 18:13       ` Doug Ledford
  1 sibling, 1 reply; 19+ messages in thread
From: Oren Duer @ 2018-01-25 14:10 UTC (permalink / raw)


>> Yes, it could be a userspace script/daemon job. But I was under the
>> impression
>> that this group tries to consolidate most (all?) of the functionality into
>> the
>> driver, and not rely on userspace daemons. Maybe a lesson learnt from
>> iSCSI?
>
>
> Indeed that is a guideline that was taken early on. But
> auto-connect/auto-discovery is not something I think we'd like to
> implement in the kernel...

So to summarize your view on this topic:
* Agreed that user experience should be such that after an RDMA device reset
  everything should continue as before without admin intervention.
* Both for host and target stacks.
* This should not be handled by the driver itself.
* A userspace daemon that listens on udev events should retry the reconnects
  or re-binds using configfs.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-25 14:10       ` Oren Duer
@ 2018-01-29 19:58         ` Sagi Grimberg
  0 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2018-01-29 19:58 UTC (permalink / raw)



> So to summarize your view on this topic:
> * Agreed that user experience should be such that after an RDMA device reset
>    everything should continue as before without admin intervention.

Yes

> * Both for host and target stacks.

Yes

> * This should not be handled by the driver itself.
> * A userspace daemon that listens on udev events should retry the reconnects
>    or re-binds using configfs.

No, I'm saying transient errors should be handled by the driver itself,
recovery from permanent errors (like unplug+replug) should be handled
outside the kernel.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-24 20:52     ` Sagi Grimberg
  2018-01-25 14:10       ` Oren Duer
@ 2018-01-25 18:13       ` Doug Ledford
  2018-01-25 19:06         ` Chuck Lever
                           ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Doug Ledford @ 2018-01-25 18:13 UTC (permalink / raw)

On Wed, 2018-01-24@22:52 +0200, Sagi Grimberg wrote:
> > > > Today host and target stacks will respond to RDMA device reset (or plug
> > > > out
> > > > and plug in) by cleaning all resources related to that device, and sitting
> > > > idle waiting for administrator intervention to reconnect (host stack) or
> > > > rebind subsystem to a port (target stack).
> > > > 
> > > > I'm thinking that maybe the right behaviour should be to try and restore
> > > > everything as soon as the device becomes available again. I don't think a
> > > > device reset should look different to the users than ports going down and
> > > > up
> > > > again.
> > > 
> > > 
> > > Hmm, not sure I fully agree here. In my mind device removal means the
> > > device is going away which means there is no point in keeping the controller
> > > around...
> > 
> > The same could have been said on a port going down. You don't know if it will
> > come back up connected to the same network...
> 
> That's true. However in my mind port events are considered transient,
> and we do give up at some point. I'm simply arguing that device removal
> has different semantics. I don't argue that we need to support it.

I think it depends on how you view yourself (meaning the target or
initiator stacks).  It's my understanding that if device eth0
disappeared completely, and then device eth1 was plugged in, and eth1
got the same ip address as eth0, then as long as any TCP sockets hadn't
gone into reset state, the iSCSI devices across the existing connection
would simply keep working.  This is correct, yes?  If so, then maybe you
want iSER at least to operate the same way.  The problem, of course, is
that iSER may use the IP address and ports for connection, but then it
transitions to queue pairs for data transfer.  Because iSER does that,
it is sitting at the same level as, say, the net core that *did* know
about the eth change in the above example and transitioned the TCP
socket from the old device to the new, meaning that iSER now has to take
that same responsibility on itself if it wishes the user visible
behavior of iSER devices to be the same as iSCSI devices.  And that
would even be true if the old RDMA device went away and a new RDMA
device came up with the old IP address, so the less drastic form of
bouncing the existing device should certainly fall under the same
umbrella.

I *think* for SRP this is already the case.  The SRP target uses the
kernel LIO framework, so if you bounce the device under the SRPt layer,
doesn't the config get preserved?  So that when the device came back up,
the LIO configuration would still be there and the SRPt driver would see
that? Bart?

For the SRP client, I'm almost certain it will try to reconnect since it
uses a user space daemon with a shell script that restarts the daemon on
various events.  That might have changed...didn't we just take a patch
to rdma-core to drop the shell script?  It might not reconnect
automatically with the latest rdma-core, I'd have to check.  Bart should
know though...

I haven't the faintest clue on NVMe over fabrics though.  But, again, I
think that's up to you guys to decide what semantics you want.  With
iSER it's a little easier since you can use the TCP semantics as a
guideline and you have an IP/port discovery so it doesn't even have to
be the same controller that comes back.  With SRP it must be the same
controller that comes back or else your login information will be all
wrong (well, we did just take RDMA_CM support patches for SRP that will
allow IP/port addressing instead, so theoretically it could now do the
same thing if you are using RDMA_CM mode logins).  I don't know the
details of the NVMe addressing though.

> > > AFAIK device resets usually are expected to quiesce inflight I/O,
> > > cleanup resources and restore when the reset sequence completes (which is
> > > what we do in nvme controller resets).

I think your perspective here might be a bit skewed by the way the NVMe
stack is implemented (which was intentional for speed as I understand
it).  As a differing example, in the SCSI stack when the LLD does a SCSI
host reset, it resets the host but does not restore or restart any
commands that were aborted.  It is up to the upper layer SCSI drivers to
do so (if they chose, they might send it back to the block layer).  From
the way you wrote the above, it sounds like the NVMe layer is almost
monolithic in nature with no separation between upper level consumer
layer and lower level driver layer, and so you can reset/restart all
internally.  I would argue that's rare in the linux kernel and most
places the low level driver resets, and some other upper layer has to
restart things if it wants or error out if it doesn't.

> > >  I'm not sure I understand why
> > > RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
> > > rdma_cm or .remove_one via ib_client API). I think the correct interface
> > > would be suspend/resume semantics for RDMA device resets (similar to pm
> > > interface).

No, we can't do this.  Suspend/Resume is not the right model for an RDMA
device reset.  An RDMA device reset is a hard action that stops all
ongoing DMA regardless of its source.  Those sources include kernel
layer consumers, user space consumers acting without the kernel's direct
intervention, and ongoing DMA with remote RDMA peers (which will throw
the remote queue pairs into an error state almost immediately).  In the
future it very likely could include RDMA between things like GPU offload
processors too.  We can't restart that stuff even if we wanted to.  So
suspend/resume semantics for an RDMA device level reset is a non-
starter.

> > > I think that it would make a much cleaner semantics and ULPs should be
> > > able to understand exactly what to do (which is what you suggested
> > > above).
> > > 
> > > CCing linux-rdma.
> > 
> > Maybe so. I don't know what's the "standard" here for Linux in general and
> > networking devices in particular. Let's see if linux-rdma agree here.
> 
> I would like to hear more opinions on the current interface.

There is a difference between RDMA device and other network devices. 
The net stack is much more like the SCSI stack in that you have an upper
layer connection (socket or otherwise) and a lower layer transport and
the net core code which is free to move your upper layer abstraction
from one lower layer transport to another.  With the RDMA subsystem,
your upper layer is connecting directly into the low level hardware.  If
you want a semantic that includes reconnection on an event, then it has
to be handled in your upper layer as there is no intervening middle
layer to abstract out the task of moving your connection from one low
level device to another (that's not to say we couldn't create one, and
several actually already exist, like SMC-R and RDS, but direct hooks
into the core ib stack are not abstracted out and you are talking
directly to the hardware).  And if you want to support moving your
connection from an old removed device to a new replacement device that
is not simply the same physical device being plugged back in, then you
need an addressing scheme that doesn't rely on the link layer hardware
address of the device.

> > > Regardless of ib_client vs. rdma_cm, we can't simply perform normal
> > > reconnects because we have dma mappings we need to unmap for each
> > > request in the tagset which we don't teardown in every reconnect (as
> > > we may have inflight I/O). We could have theoretically use reinit_tagset
> > > to do that though.
> > 
> > Obviously it isn't that simple... Just trying to agree on the right direction
> > to go.
> 
> Yea, I agree. It shouldn't be too hard also.
> 
> > > > In the reconnect flow the stack already repeats creating the cm_id and
> > > > resolving address and route, so when the RDMA device comes back up, and
> > > > assuming it will be configured with the same address and connected to the
> > > > same
> > > > network (as is the case in device reset), connections will be restored
> > > > automatically.
> > > 
> > > 
> > > As I said, I think that the problem is the interface of RDMA device
> > > resets. IMO, device removal means we need to delete all the nvme
> > > controllers associated with the device.
> > 
> > Do you think all associated controllers should be deleted when a TCP socket
> > gets disconnected in NVMe-over-TCP? Do they?
> 
> Nope, but that is equivalent to QP going into error state IMO, and we
> don't do that in nvme-rdma as well.

There is no equivalent in the TCP realm of an RDMA controller reset or
an RDMA controller permanent removal event.  When dealing with TCP, if
the underlying ethernet device is reset, you *might* get a TCP socket
reset, you might not.  If the underlying ethernet is removed, you might
get a socket reset, you might not, depending on how the route to the
remote host is re-established.  If all IP capable devices in the entire
system are removed, your TCP socket will get a reset, and attempts to
reconnect will get an error.

None of those sound semantically comparable to RDMA device
unplug/replug.  Again, that's just because the net core never percolates
that up to the TCP layer.

When you have a driver that has both TCP and RDMA transports, the truth
is you are plugging into two very different levels of the kernel and the
work you have to do to support one is very different from the other.  I
don't think it's worthwhile to even talk about trying to treat them
equivalently unless you want to take on an address scheme and
reset/restart capability in the RDMA side of things that you don't have
to have in the TCP side of things.

As a user of things like iSER/SRP/NVMe, I would personally like
connections to persist across non-fatal events.  But the RDMA stack, as
it stands, can't reconnect things for you, you would have to do that in
your own code.

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180125/e266443e/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-25 18:13       ` Doug Ledford
@ 2018-01-25 19:06         ` Chuck Lever
  2018-01-29 20:01           ` Sagi Grimberg
  2018-01-25 22:48         ` Jason Gunthorpe
  2018-01-29 20:36         ` Sagi Grimberg
  2 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2018-01-25 19:06 UTC (permalink / raw)




> On Jan 25, 2018,@10:13 AM, Doug Ledford <dledford@redhat.com> wrote:
> 
> On Wed, 2018-01-24@22:52 +0200, Sagi Grimberg wrote:
>>>>> Today host and target stacks will respond to RDMA device reset (or plug
>>>>> out
>>>>> and plug in) by cleaning all resources related to that device, and sitting
>>>>> idle waiting for administrator intervention to reconnect (host stack) or
>>>>> rebind subsystem to a port (target stack).
>>>>> 
>>>>> I'm thinking that maybe the right behaviour should be to try and restore
>>>>> everything as soon as the device becomes available again. I don't think a
>>>>> device reset should look different to the users than ports going down and
>>>>> up
>>>>> again.
>>>> 
>>>> 
>>>> Hmm, not sure I fully agree here. In my mind device removal means the
>>>> device is going away which means there is no point in keeping the controller
>>>> around...
>>> 
>>> The same could have been said on a port going down. You don't know if it will
>>> come back up connected to the same network...
>> 
>> That's true. However in my mind port events are considered transient,
>> and we do give up at some point. I'm simply arguing that device removal
>> has different semantics. I don't argue that we need to support it.
> 
> I think it depends on how you view yourself (meaning the target or
> initiator stacks).  It's my understanding that if device eth0
> disappeared completely, and then device eth1 was plugged in, and eth1
> got the same ip address as eth0, then as long as any TCP sockets hadn't
> gone into reset state, the iSCSI devices across the existing connection
> would simply keep working.  This is correct, yes?

For NFS/RDMA, I think of the "failover" case where a device is
removed, then a new one is plugged in (or an existing cold
replacement is made available) with the same IP configuration.

On a "hard" NFS mount, we want the upper layers to wait for
a new suitable device to be made available, and then to use
it to resend any pending RPCs. The workload should continue
after a new device is available.

Feel free to tell me I'm full of turtles.


> If so, then maybe you
> want iSER at least to operate the same way.  The problem, of course, is
> that iSER may use the IP address and ports for connection, but then it
> transitions to queue pairs for data transfer.  Because iSER does that,
> it is sitting at the same level as, say, the net core that *did* know
> about the eth change in the above example and transitioned the TCP
> socket from the old device to the new, meaning that iSER now has to take
> that same responsibility on itself if it wishes the user visible
> behavior of iSER devices to be the same as iSCSI devices.  And that
> would even be true if the old RDMA device went away and a new RDMA
> device came up with the old IP address, so the less drastic form of
> bouncing the existing device should certainly fall under the same
> umbrella.
> 
> I *think* for SRP this is already the case.  The SRP target uses the
> kernel LIO framework, so if you bounce the device under the SRPt layer,
> doesn't the config get preserved?  So that when the device came back up,
> the LIO configuration would still be there and the SRPt driver would see
> that? Bart?
> 
> For the SRP client, I'm almost certain it will try to reconnect since it
> uses a user space daemon with a shell script that restarts the daemon on
> various events.  That might have changed...didn't we just take a patch
> to rdma-core to drop the shell script?  It might not reconnect
> automatically with the latest rdma-core, I'd have to check.  Bart should
> know though...
> 
> I haven't the faintest clue on NVMe over fabrics though.  But, again, I
> think that's up to you guys to decide what semantics you want.  With
> iSER it's a little easier since you can use the TCP semantics as a
> guideline and you have an IP/port discovery so it doesn't even have to
> be the same controller that comes back.  With SRP it must be the same
> controller that comes back or else your login information will be all
> wrong (well, we did just take RDMA_CM support patches for SRP that will
> allow IP/port addressing instead, so theoretically it could now do the
> same thing if you are using RDMA_CM mode logins).  I don't know the
> details of the NVMe addressing though.
> 
>>>> AFAIK device resets usually are expected to quiesce inflight I/O,
>>>> cleanup resources and restore when the reset sequence completes (which is
>>>> what we do in nvme controller resets).
> 
> I think your perspective here might be a bit skewed by the way the NVMe
> stack is implemented (which was intentional for speed as I understand
> it).  As a differing example, in the SCSI stack when the LLD does a SCSI
> host reset, it resets the host but does not restore or restart any
> commands that were aborted.  It is up to the upper layer SCSI drivers to
> do so (if they chose, they might send it back to the block layer).  From
> the way you wrote the above, it sounds like the NVMe layer is almost
> monolithic in nature with no separation between upper level consumer
> layer and lower level driver layer, and so you can reset/restart all
> internally.  I would argue that's rare in the linux kernel and most
> places the low level driver resets, and some other upper layer has to
> restart things if it wants or error out if it doesn't.
> 
>>>> I'm not sure I understand why
>>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
>>>> rdma_cm or .remove_one via ib_client API). I think the correct interface
>>>> would be suspend/resume semantics for RDMA device resets (similar to pm
>>>> interface).
> 
> No, we can't do this.  Suspend/Resume is not the right model for an RDMA
> device reset.  An RDMA device reset is a hard action that stops all
> ongoing DMA regardless of its source.  Those sources include kernel
> layer consumers, user space consumers acting without the kernel's direct
> intervention, and ongoing DMA with remote RDMA peers (which will throw
> the remote queue pairs into an error state almost immediately).  In the
> future it very likely could include RDMA between things like GPU offload
> processors too.  We can't restart that stuff even if we wanted to.  So
> suspend/resume semantics for an RDMA device level reset is a non-
> starter.
> 
>>>> I think that it would make a much cleaner semantics and ULPs should be
>>>> able to understand exactly what to do (which is what you suggested
>>>> above).
>>>> 
>>>> CCing linux-rdma.
>>> 
>>> Maybe so. I don't know what's the "standard" here for Linux in general and
>>> networking devices in particular. Let's see if linux-rdma agree here.
>> 
>> I would like to hear more opinions on the current interface.
> 
> There is a difference between RDMA device and other network devices. 
> The net stack is much more like the SCSI stack in that you have an upper
> layer connection (socket or otherwise) and a lower layer transport and
> the net core code which is free to move your upper layer abstraction
> from one lower layer transport to another.  With the RDMA subsystem,
> your upper layer is connecting directly into the low level hardware.  If
> you want a semantic that includes reconnection on an event, then it has
> to be handled in your upper layer as there is no intervening middle
> layer to abstract out the task of moving your connection from one low
> level device to another (that's not to say we couldn't create one, and
> several actually already exist, like SMC-R and RDS, but direct hooks
> into the core ib stack are not abstracted out and you are talking
> directly to the hardware).  And if you want to support moving your
> connection from an old removed device to a new replacement device that
> is not simply the same physical device being plugged back in, then you
> need an addressing scheme that doesn't rely on the link layer hardware
> address of the device.
> 
>>>> Regardless of ib_client vs. rdma_cm, we can't simply perform normal
>>>> reconnects because we have dma mappings we need to unmap for each
>>>> request in the tagset which we don't teardown in every reconnect (as
>>>> we may have inflight I/O). We could have theoretically use reinit_tagset
>>>> to do that though.
>>> 
>>> Obviously it isn't that simple... Just trying to agree on the right direction
>>> to go.
>> 
>> Yea, I agree. It shouldn't be too hard also.
>> 
>>>>> In the reconnect flow the stack already repeats creating the cm_id and
>>>>> resolving address and route, so when the RDMA device comes back up, and
>>>>> assuming it will be configured with the same address and connected to the
>>>>> same
>>>>> network (as is the case in device reset), connections will be restored
>>>>> automatically.
>>>> 
>>>> 
>>>> As I said, I think that the problem is the interface of RDMA device
>>>> resets. IMO, device removal means we need to delete all the nvme
>>>> controllers associated with the device.
>>> 
>>> Do you think all associated controllers should be deleted when a TCP socket
>>> gets disconnected in NVMe-over-TCP? Do they?
>> 
>> Nope, but that is equivalent to QP going into error state IMO, and we
>> don't do that in nvme-rdma as well.
> 
> There is no equivalent in the TCP realm of an RDMA controller reset or
> an RDMA controller permanent removal event.  When dealing with TCP, if
> the underlying ethernet device is reset, you *might* get a TCP socket
> reset, you might not.  If the underlying ethernet is removed, you might
> get a socket reset, you might not, depending on how the route to the
> remote host is re-established.  If all IP capable devices in the entire
> system are removed, your TCP socket will get a reset, and attempts to
> reconnect will get an error.
> 
> None of those sound semantically comparable to RDMA device
> unplug/replug.  Again, that's just because the net core never percolates
> that up to the TCP layer.
> 
> When you have a driver that has both TCP and RDMA transports, the truth
> is you are plugging into two very different levels of the kernel and the
> work you have to do to support one is very different from the other.  I
> don't think it's worthwhile to even talk about trying to treat them
> equivalently unless you want to take on an address scheme and
> reset/restart capability in the RDMA side of things that you don't have
> to have in the TCP side of things.
> 
> As a user of things like iSER/SRP/NVMe, I would personally like
> connections to persist across non-fatal events.  But the RDMA stack, as
> it stands, can't reconnect things for you, you would have to do that in
> your own code.
> 
> -- 
> Doug Ledford <dledford at redhat.com>
>    GPG KeyID: B826A3330E572FDD
>    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
Chuck Lever

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-25 19:06         ` Chuck Lever
@ 2018-01-29 20:01           ` Sagi Grimberg
  2018-01-29 20:11             ` Chuck Lever
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2018-01-29 20:01 UTC (permalink / raw)


Hi Chuck,

> For NFS/RDMA, I think of the "failover" case where a device is
> removed, then a new one is plugged in (or an existing cold
> replacement is made available) with the same IP configuration.
> 
> On a "hard" NFS mount, we want the upper layers to wait for
> a new suitable device to be made available, and then to use
> it to resend any pending RPCs. The workload should continue
> after a new device is available.

Really? so the context is held forever (in case the device never
comes back)?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-29 20:01           ` Sagi Grimberg
@ 2018-01-29 20:11             ` Chuck Lever
  2018-01-29 21:27               ` Doug Ledford
  0 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2018-01-29 20:11 UTC (permalink / raw)

> On Jan 29, 2018,@3:01 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> Hi Chuck,
> 
>> For NFS/RDMA, I think of the "failover" case where a device is
>> removed, then a new one is plugged in (or an existing cold
>> replacement is made available) with the same IP configuration.
>> On a "hard" NFS mount, we want the upper layers to wait for
>> a new suitable device to be made available, and then to use
>> it to resend any pending RPCs. The workload should continue
>> after a new device is available.
> 
> Really? so the context is held forever (in case the device never
> comes back)?

I didn't say this was the best approach :-) And it certainly can
change if we have something better.

But yes, with a hard mount, the NFS and RPC client stack keeps
the pending RPCs around and continues to attempt reconnection
with the NFS server.

The idea is that after an unplug, another device with the proper
IP configuration can be made available, and then rdma_resolve_addr()
can figure out how to reconnect.

The associated NFS workload will be suspended until it can reconnect.

Now on the NFS server (target) an unplug results in connection
abort. Any context at the transport layer is gone, though the NFS
server maintains duplicate reply caches that can hold RPC replies
for some time. Those are all bounded in size.

The clients continue to attempt to reconnect until there is another
device available that can allow the server to accept connections.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-29 20:11             ` Chuck Lever
@ 2018-01-29 21:27               ` Doug Ledford
  2018-01-29 21:46                 ` Chuck Lever
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Ledford @ 2018-01-29 21:27 UTC (permalink / raw)


On Mon, 2018-01-29@15:11 -0500, Chuck Lever wrote:
> > On Jan 29, 2018,@3:01 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> > 
> > Hi Chuck,
> > 
> > > For NFS/RDMA, I think of the "failover" case where a device is
> > > removed, then a new one is plugged in (or an existing cold
> > > replacement is made available) with the same IP configuration.
> > > On a "hard" NFS mount, we want the upper layers to wait for
> > > a new suitable device to be made available, and then to use
> > > it to resend any pending RPCs. The workload should continue
> > > after a new device is available.
> > 
> > Really? so the context is held forever (in case the device never
> > comes back)?
> 
> I didn't say this was the best approach :-) And it certainly can
> change if we have something better.

Whether it's the best or not, it's the defined behavior of the "hard"
mount option.  So if someone doesn't want that, you don't use a hard
mount ;-)

Hard mounts are great for situations where you have a high degree of
faith that even if they server disappears, it will reappear soon.  They
suck when the server totally dies though, because now all the hard mount
clients are stuck :-/.

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180129/178ad2ea/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-29 21:27               ` Doug Ledford
@ 2018-01-29 21:46                 ` Chuck Lever
  0 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2018-01-29 21:46 UTC (permalink / raw)




> On Jan 29, 2018,@4:27 PM, Doug Ledford <dledford@redhat.com> wrote:
> 
> On Mon, 2018-01-29@15:11 -0500, Chuck Lever wrote:
>>> On Jan 29, 2018,@3:01 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>>> 
>>> Hi Chuck,
>>> 
>>>> For NFS/RDMA, I think of the "failover" case where a device is
>>>> removed, then a new one is plugged in (or an existing cold
>>>> replacement is made available) with the same IP configuration.
>>>> On a "hard" NFS mount, we want the upper layers to wait for
>>>> a new suitable device to be made available, and then to use
>>>> it to resend any pending RPCs. The workload should continue
>>>> after a new device is available.
>>> 
>>> Really? so the context is held forever (in case the device never
>>> comes back)?
>> 
>> I didn't say this was the best approach :-) And it certainly can
>> change if we have something better.
> 
> Whether it's the best or not, it's the defined behavior of the "hard"
> mount option.  So if someone doesn't want that, you don't use a hard
> mount ;-)
> 
> Hard mounts are great for situations where you have a high degree of
> faith that even if they server disappears, it will reappear soon.  They
> suck when the server totally dies though, because now all the hard mount
> clients are stuck :-/.

We're working on fixing that.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-25 18:13       ` Doug Ledford
  2018-01-25 19:06         ` Chuck Lever
@ 2018-01-25 22:48         ` Jason Gunthorpe
  2018-01-29 20:36         ` Sagi Grimberg
  2 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2018-01-25 22:48 UTC (permalink / raw)


On Thu, Jan 25, 2018@01:13:42PM -0500, Doug Ledford wrote:

> various events.  That might have changed...didn't we just take a patch
> to rdma-core to drop the shell script?  It might not reconnect
> automatically with the latest rdma-core, I'd have to check.  Bart should
> know though...

We dropped the shell script in favor of udev. srp_daemon will now
stop when rdma devices are removed and restart again when they are
added.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-25 18:13       ` Doug Ledford
  2018-01-25 19:06         ` Chuck Lever
  2018-01-25 22:48         ` Jason Gunthorpe
@ 2018-01-29 20:36         ` Sagi Grimberg
  2018-01-29 21:34           ` Bart Van Assche
  2018-01-29 22:28           ` Doug Ledford
  2 siblings, 2 replies; 19+ messages in thread
From: Sagi Grimberg @ 2018-01-29 20:36 UTC (permalink / raw)



> I *think* for SRP this is already the case.  The SRP target uses the
> kernel LIO framework, so if you bounce the device under the SRPt layer,
> doesn't the config get preserved?  So that when the device came back up,
> the LIO configuration would still be there and the SRPt driver would see
> that? Bart?

I think you're right. I think we can do that if we keep the listener
cm_id device node_guid and when a new device comes in we can see if we 
have a cm listener on that device and re-listen. That is a good idea
Doug.

> For the SRP client, I'm almost certain it will try to reconnect since it
> uses a user space daemon with a shell script that restarts the daemon on
> various events.  That might have changed...didn't we just take a patch
> to rdma-core to drop the shell script?  It might not reconnect
> automatically with the latest rdma-core, I'd have to check.  Bart should
> know though...

srp driver relies on srp_daemon to discover and connect again over the
new device. iSER relies on iscsiadm to reconnect. I guess it should be
the correct approach for nvme as well (which we don't have at the
moment)...

>>>> AFAIK device resets usually are expected to quiesce inflight I/O,
>>>> cleanup resources and restore when the reset sequence completes (which is
>>>> what we do in nvme controller resets).
> 
> I think your perspective here might be a bit skewed by the way the NVMe
> stack is implemented (which was intentional for speed as I understand
> it).  As a differing example, in the SCSI stack when the LLD does a SCSI
> host reset, it resets the host but does not restore or restart any
> commands that were aborted.  It is up to the upper layer SCSI drivers to
> do so (if they chose, they might send it back to the block layer).  From
> the way you wrote the above, it sounds like the NVMe layer is almost
> monolithic in nature with no separation between upper level consumer
> layer and lower level driver layer, and so you can reset/restart all
> internally.  I would argue that's rare in the linux kernel and most
> places the low level driver resets, and some other upper layer has to
> restart things if it wants or error out if it doesn't.

That is the case for nvme as well, but I was merely saying that device
reset is not really a device removal. And this makes it hard for the ULP
to understand what to do (or for me at least...)

>>>>   I'm not sure I understand why
>>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
>>>> rdma_cm or .remove_one via ib_client API). I think the correct interface
>>>> would be suspend/resume semantics for RDMA device resets (similar to pm
>>>> interface).
> 
> No, we can't do this.  Suspend/Resume is not the right model for an RDMA
> device reset.  An RDMA device reset is a hard action that stops all
> ongoing DMA regardless of its source.

Suspend also requires that.

> Those sources include kernel
> layer consumers, user space consumers acting without the kernel's direct
> intervention, and ongoing DMA with remote RDMA peers (which will throw
> the remote queue pairs into an error state almost immediately).  In the
> future it very likely could include RDMA between things like GPU offload
> processors too.  We can't restart that stuff even if we wanted to.  So
> suspend/resume semantics for an RDMA device level reset is a non-
> starter.

I see. I can understand the argument "we are stuck with what we have"
for user-space, but does that mandate that we must live with that for
kernel consumers as well? Even if the semantics is confusing? (Just
asking, its only my opinion :))

>>>> I think that it would make a much cleaner semantics and ULPs should be
>>>> able to understand exactly what to do (which is what you suggested
>>>> above).
>>>>
>>>> CCing linux-rdma.
>>>
>>> Maybe so. I don't know what's the "standard" here for Linux in general and
>>> networking devices in particular. Let's see if linux-rdma agree here.
>>
>> I would like to hear more opinions on the current interface.
> 
> There is a difference between RDMA device and other network devices.
> The net stack is much more like the SCSI stack in that you have an upper
> layer connection (socket or otherwise) and a lower layer transport and
> the net core code which is free to move your upper layer abstraction
> from one lower layer transport to another.  With the RDMA subsystem,
> your upper layer is connecting directly into the low level hardware.  If
> you want a semantic that includes reconnection on an event, then it has
> to be handled in your upper layer as there is no intervening middle
> layer to abstract out the task of moving your connection from one low
> level device to another (that's not to say we couldn't create one, and
> several actually already exist, like SMC-R and RDS, but direct hooks
> into the core ib stack are not abstracted out and you are talking
> directly to the hardware).  And if you want to support moving your
> connection from an old removed device to a new replacement device that
> is not simply the same physical device being plugged back in, then you
> need an addressing scheme that doesn't rely on the link layer hardware
> address of the device.

Actually, I didn't suggest that at all. I fully agree that the ULP needs
to cooperate with the core and the HW as its holding physical resources.
All I suggested is that the core would reflect that the device is
resetting and not reflect that the device is going away, and after that
a new device comes in, that happens to be the same device...

> As a user of things like iSER/SRP/NVMe, I would personally like
> connections to persist across non-fatal events.  But the RDMA stack, as
> it stands, can't reconnect things for you, you would have to do that in
> your own code.

Again, I fully agree. Didn't mean that the core would handle everything
for the consumer of the device at all. I just think that the interface
can improve such that the consumers life (and code) would be easier.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-29 20:36         ` Sagi Grimberg
@ 2018-01-29 21:34           ` Bart Van Assche
  2018-01-29 22:28           ` Doug Ledford
  1 sibling, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2018-01-29 21:34 UTC (permalink / raw)

On Mon, 2018-01-29@22:36 +0200, Sagi Grimberg wrote:
> > I *think* for SRP this is already the case.  The SRP target uses the
> > kernel LIO framework, so if you bounce the device under the SRPt layer,
> > doesn't the config get preserved?  So that when the device came back up,
> > the LIO configuration would still be there and the SRPt driver would see
> > that? Bart?
> 
> I think you're right. I think we can do that if we keep the listener
> cm_id device node_guid and when a new device comes in we can see if we 
> have a cm listener on that device and re-listen. That is a good idea
> Doug.

Sorry that I hadn't noticed this e-mail thread earlier and that I had not yet
replied. The SRPT config should get preserved as long as the device removal
function (srpt_remove_one()) does not get called.

> > For the SRP client, I'm almost certain it will try to reconnect since it
> > uses a user space daemon with a shell script that restarts the daemon on
> > various events.  That might have changed...didn't we just take a patch
> > to rdma-core to drop the shell script?  It might not reconnect
> > automatically with the latest rdma-core, I'd have to check.  Bart should
> > know though...
> 
> srp driver relies on srp_daemon to discover and connect again over the
> new device. iSER relies on iscsiadm to reconnect. I guess it should be
> the correct approach for nvme as well (which we don't have at the
> moment)...

There are two mechanisms for the SRP initiator to make it reconnect to an SRP
target:
1. srp_daemon. Even with the latest rdma-core changes srp_daemon should still
   discover SRP targets and reconnect to the target systems it is allowed to
   reconnect to by its configuration file.
2. The reconnection mechanism in the SCSI SRP transport layer. See also the
   documentation of the reconnect_delay in
   https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-transport-srp

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-29 20:36         ` Sagi Grimberg
  2018-01-29 21:34           ` Bart Van Assche
@ 2018-01-29 22:28           ` Doug Ledford
  2018-01-30 15:03             ` Oren Duer
  1 sibling, 1 reply; 19+ messages in thread
From: Doug Ledford @ 2018-01-29 22:28 UTC (permalink / raw)

On Mon, 2018-01-29@22:36 +0200, Sagi Grimberg wrote:
> > 
> That is the case for nvme as well, but I was merely saying that device
> reset is not really a device removal. And this makes it hard for the ULP
> to understand what to do (or for me at least...)

OK, I get that the difference between the two is making it hard to
understand what to do.  But, the truth of the issue is that whether you
are doing a reset or a remove/add cycle, what *your* code needs to do
doesn't change.  For both cases, your code must A) drop everything on
the floor like a hot potato and B) restart from scratch.  The only thing
that's confusing you is that it's more or less assumed on a reset that
you would auto-restart, where as it isn't so clear that you would want
to do the same on a remove/add cycle.  I think the answer to your
question is: if the same device comes back that went away, then yes,
auto-restart would seem appropriate.  If you make that policy decision,
then the *only* difference between device reset and device hot-replug is
that you actually have to verify that the same device came back as went
away.

As an optional item, you could start a timer when the device disappears,
and if it takes more than, say, 10 minutes to reappear, you could cancel
the auto-restart on the basis that someone probably physically unplugged
and replugged the card and they might not want that.  But really, aside
from the fact that the hot plug flow needs you to check the same device
comes back, reset and hot plug have the exact same requirements/needs
and can be serviced by a single code path.

> > > > >   I'm not sure I understand why
> > > > > RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
> > > > > rdma_cm or .remove_one via ib_client API). I think the correct interface
> > > > > would be suspend/resume semantics for RDMA device resets (similar to pm
> > > > > interface).
> > 
> > No, we can't do this.  Suspend/Resume is not the right model for an RDMA
> > device reset.  An RDMA device reset is a hard action that stops all
> > ongoing DMA regardless of its source.
> 
> Suspend also requires that.

But suspend has a locale semantic of "local to this machine" and usually
at least attempts to stop gracefully.  Because RDMA allows for things
such as a remote machine doing an RDMA READ when we suspend, we can't
even attempt the normal graceful shutdown and are left with only the
nuclear reset option.

In addition, if you reset a network card, the network card's registers
don't disappear, and your PCI MMIO region doesn't go away.  When you
reset an RDMA adapter, all of allocated memory regions for card
communications that have been handed out to kernel space, user space,
etc. *do* disappear.  That isn't really like the suspend semantic.  You
don't have the option of cleanly stopping things and quiescing the
system prior to suspend, because your basic communication channel is
gone already.  From this point of view, the hot remove semantic is very
fitting.  The entire card didn't get hot removed, but certainly all of
those allocated communication channels very well did.

> > Those sources include kernel
> > layer consumers, user space consumers acting without the kernel's direct
> > intervention, and ongoing DMA with remote RDMA peers (which will throw
> > the remote queue pairs into an error state almost immediately).  In the
> > future it very likely could include RDMA between things like GPU offload
> > processors too.  We can't restart that stuff even if we wanted to.  So
> > suspend/resume semantics for an RDMA device level reset is a non-
> > starter.
> 
> I see. I can understand the argument "we are stuck with what we have"
> for user-space, but does that mandate that we must live with that for
> kernel consumers as well? Even if the semantics is confusing? (Just
> asking, its only my opinion :))

See above.  It's not about user versus kernel space, it's that we really
did hot-remove a bunch of resources, even if not the card itself.

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180129/29b6457c/attachment.sig>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-29 22:28           ` Doug Ledford
@ 2018-01-30 15:03             ` Oren Duer
  2018-01-30 17:24               ` Doug Ledford
  0 siblings, 1 reply; 19+ messages in thread
From: Oren Duer @ 2018-01-30 15:03 UTC (permalink / raw)

On Tue, Jan 30, 2018@12:28 AM, Doug Ledford <dledford@redhat.com> wrote:
> On Mon, 2018-01-29@22:36 +0200, Sagi Grimberg wrote:
>> >
>> That is the case for nvme as well, but I was merely saying that device
>> reset is not really a device removal. And this makes it hard for the ULP
>> to understand what to do (or for me at least...)
>
> OK, I get that the difference between the two is making it hard to
> understand what to do.  But, the truth of the issue is that whether you
> are doing a reset or a remove/add cycle, what *your* code needs to do
> doesn't change.  For both cases, your code must A) drop everything on
> the floor like a hot potato and B) restart from scratch.

Fully agree here. You don't want different code flow for reset. You already
have remove/add flows, which require to to act the right way as Doug
described. A reset is exactly remove and add of the same device.

> The only thing
> that's confusing you is that it's more or less assumed on a reset that
> you would auto-restart, where as it isn't so clear that you would want
> to do the same on a remove/add cycle.  I think the answer to your
> question is: if the same device comes back that went away, then yes,
> auto-restart would seem appropriate.  If you make that policy decision,
> then the *only* difference between device reset and device hot-replug is
> that you actually have to verify that the same device came back as went
> away.
>
> As an optional item, you could start a timer when the device disappears,
> and if it takes more than, say, 10 minutes to reappear, you could cancel
> the auto-restart on the basis that someone probably physically unplugged
> and replugged the card and they might not want that.  But really, aside
> from the fact that the hot plug flow needs you to check the same device
> comes back, reset and hot plug have the exact same requirements/needs
> and can be serviced by a single code path.

Not sure why we need to keep track whether it is the same device or not.
I fail to understand why we trust the system admin to create the connections
at the beginning, but we should not trust him anymore if the device was
removed, a new device was added in place of it, and it was configured with the
same network IP/subnet. To me it looks like the admin wanted exactly that:
for the connections to be restored over the new device.

If the admin does not want the reconnections to happen, he has the option
to explicitly request to disconnect. Same if the device was removed forever.

And to help us with all that, we have rdma_cm. As long as we repeat the
rdma_connect() on the initiator side and rdma_bind/resolve_addr()... on
the target side, we'll get exactly this behaviour.

If we want this using userspace daemons - that's fine. We'll need one for the
initiator stack and one for the target stack. And we'll probably need some
missing udev events and configfs entry?

-- 
Oren

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-30 15:03             ` Oren Duer
@ 2018-01-30 17:24               ` Doug Ledford
  2018-01-30 17:51                 ` Steve Wise
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Ledford @ 2018-01-30 17:24 UTC (permalink / raw)


On Tue, 2018-01-30@17:03 +0200, Oren Duer wrote:
> Not sure why we need to keep track whether it is the same device or not.
> I fail to understand why we trust the system admin to create the connections
> at the beginning, but we should not trust him anymore if the device was
> removed, a new device was added in place of it, and it was configured with the
> same network IP/subnet. To me it looks like the admin wanted exactly that:
> for the connections to be restored over the new device.

In my original email, I pointed out that I didn't know how the NVMe over
Fabrics was doing addressing.  If it were like SRP, then it wouldn't
work because the device GUID is part of the addressing scheme, so a new
device wouldn't automatically show as matching that address.  If,
however, you use IP like iSER, then yes, I agree fully you can just test
the ability to route over the new device using the IP address and
restore if it works.

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180130/52de7d54/attachment.sig>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Reconnect on RDMA device reset
  2018-01-30 17:24               ` Doug Ledford
@ 2018-01-30 17:51                 ` Steve Wise
  0 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2018-01-30 17:51 UTC (permalink / raw)


> 
> In my original email, I pointed out that I didn't know how the NVMe over
> Fabrics was doing addressing.  If it were like SRP, then it wouldn't
> work because the device GUID is part of the addressing scheme, so a new
> device wouldn't automatically show as matching that address.  If,
> however, you use IP like iSER, then yes, I agree fully you can just test
> the ability to route over the new device using the IP address and
> restore if it works.

NVME-oF/rdma uses the rdma_cm with ip addresses, ip ports, etc...

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-01-30 17:51 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-22 15:07 Reconnect on RDMA device reset Oren Duer
2018-01-23 12:42 ` Sagi Grimberg
2018-01-24  7:41   ` Oren Duer
2018-01-24 20:52     ` Sagi Grimberg
2018-01-25 14:10       ` Oren Duer
2018-01-29 19:58         ` Sagi Grimberg
2018-01-25 18:13       ` Doug Ledford
2018-01-25 19:06         ` Chuck Lever
2018-01-29 20:01           ` Sagi Grimberg
2018-01-29 20:11             ` Chuck Lever
2018-01-29 21:27               ` Doug Ledford
2018-01-29 21:46                 ` Chuck Lever
2018-01-25 22:48         ` Jason Gunthorpe
2018-01-29 20:36         ` Sagi Grimberg
2018-01-29 21:34           ` Bart Van Assche
2018-01-29 22:28           ` Doug Ledford
2018-01-30 15:03             ` Oren Duer
2018-01-30 17:24               ` Doug Ledford
2018-01-30 17:51                 ` Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).