* Reconnect on RDMA device reset @ 2018-01-22 15:07 Oren Duer 2018-01-23 12:42 ` Sagi Grimberg 0 siblings, 1 reply; 19+ messages in thread From: Oren Duer @ 2018-01-22 15:07 UTC (permalink / raw) Hi, Today host and target stacks will respond to RDMA device reset (or plug out and plug in) by cleaning all resources related to that device, and sitting idle waiting for administrator intervention to reconnect (host stack) or rebind subsystem to a port (target stack). I'm thinking that maybe the right behaviour should be to try and restore everything as soon as the device becomes available again. I don't think a device reset should look different to the users than ports going down and up again. At the host stack we already have a reconnect flow (which works great when ports go down and back up). Instead of registering to ib_client callback rdma_remove_one and clean up everything, we could respond to the RDMA_CM_EVENT_DEVICE_REMOVAL event and go into that reconnect flow. In the reconnect flow the stack already repeats creating the cm_id and resolving address and route, so when the RDMA device comes back up, and assuming it will be configured with the same address and connected to the same network (as is the case in device reset), connections will be restored automatically. At the target stack things are even worse. When the RDMA device resets or disappears the softlink between the port and the subsystem stays "hanging". It does not represent an active bind, and when the device will come back with the same address and network it will not start working (even though the softlink is there). This is quite confusing to the user. What I suggest here is to implement something similar to the reconnect flow at the host, and repeat the flow that is doing the rdma_bind_addr. This way, again, when the device will come back with the same address and network the bind will succeed and the subsystem will become functional again. In this case it makes sense to keep the softlink during all this time, as the stack really tries to re-bind to the port. These changes also clean the code as RDMA_CM applications should not be registering as ib_clients in the first place... Thoughts? Oren Duer Mellanox Technologies ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-22 15:07 Reconnect on RDMA device reset Oren Duer @ 2018-01-23 12:42 ` Sagi Grimberg 2018-01-24 7:41 ` Oren Duer 0 siblings, 1 reply; 19+ messages in thread From: Sagi Grimberg @ 2018-01-23 12:42 UTC (permalink / raw) > Hi, Hey Oren, > Today host and target stacks will respond to RDMA device reset (or plug out > and plug in) by cleaning all resources related to that device, and sitting > idle waiting for administrator intervention to reconnect (host stack) or > rebind subsystem to a port (target stack). > > I'm thinking that maybe the right behaviour should be to try and restore > everything as soon as the device becomes available again. I don't think a > device reset should look different to the users than ports going down and up > again. Hmm, not sure I fully agree here. In my mind device removal means the device is going away which means there is no point in keeping the controller around... AFAIK device resets usually are expected to quiesce inflight I/O, cleanup resources and restore when the reset sequence completes (which is what we do in nvme controller resets). I'm not sure I understand why RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via rdma_cm or .remove_one via ib_client API). I think the correct interface would be suspend/resume semantics for RDMA device resets (similar to pm interface). I think that it would make a much cleaner semantics and ULPs should be able to understand exactly what to do (which is what you suggested above). CCing linux-rdma. > At the host stack we already have a reconnect flow (which works great when > ports go down and back up). Instead of registering to ib_client callback > rdma_remove_one and clean up everything, we could respond to the > RDMA_CM_EVENT_DEVICE_REMOVAL event and go into that reconnect flow. Regardless of ib_client vs. rdma_cm, we can't simply perform normal reconnects because we have dma mappings we need to unmap for each request in the tagset which we don't teardown in every reconnect (as we may have inflight I/O). We could have theoretically use reinit_tagset to do that though. Personally I think ib_client is much better than the rdma_cm DEVICE_REMOVAL event interface because: (1) rdma_cm is per cm_id which means we effectively only react to the first one and the rest are nops which is a bit awkward (2) it requires special handling for resource cleanup with respect to the cm_id removal which must be destroyed within the DEVICE_REMOVAL event by returning non-zero return value from the event handler (as rdma_destroy_id() would block from the event_handler context) and must not be done in the removal sequence (which is the normal flow). Both of these make unnecessary complications which are much cleaner with ib_client interface. see Steve's commit e87a911fed07 ("nvme-rdma: use ib_client API to detect device removal") > > In the reconnect flow the stack already repeats creating the cm_id and > resolving address and route, so when the RDMA device comes back up, and > assuming it will be configured with the same address and connected to the same > network (as is the case in device reset), connections will be restored > automatically. As I said, I think that the problem is the interface of RDMA device resets. IMO, device removal means we need to delete all the nvme controllers associated with the device. If we were to handle hotplug events where devices come into the system, the correct way would be to send a udev event to userspace and not keep stale controllers around with hope they will come back. userspace is a much better place to keep a state with respect to these scenarios IMO. > At the target stack things are even worse. When the RDMA device resets or > disappears the softlink between the port and the subsystem stays "hanging". It > does not represent an active bind, and when the device will come back with the > same address and network it will not start working (even though the softlink > is there). This is quite confusing to the user. Right, I think we would need to reflect port state (active/inactive) via configfs and nvmetcli could reflect it in its UI. > What I suggest here is to implement something similar to the reconnect flow at > the host, and repeat the flow that is doing the rdma_bind_addr. This way, > again, when the device will come back with the same address and network the > bind will succeed and the subsystem will become functional again. In this case > it makes sense to keep the softlink during all this time, as the stack really > tries to re-bind to the port. I'm sorry but I don't think that is the correct approach. If the device is removed than we break the association and do nothing else. As for RDMA device resets, this goes back to the interface problem I pointed out. > These changes also clean the code as RDMA_CM applications should not be > registering as ib_clients in the first place... I don't think that there is a problem in rdma_cm applications registering to the ib_client API. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-23 12:42 ` Sagi Grimberg @ 2018-01-24 7:41 ` Oren Duer 2018-01-24 20:52 ` Sagi Grimberg 0 siblings, 1 reply; 19+ messages in thread From: Oren Duer @ 2018-01-24 7:41 UTC (permalink / raw) On Tue, Jan 23, 2018@2:42 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > >> Hi, > > > Hey Oren, > >> Today host and target stacks will respond to RDMA device reset (or plug >> out >> and plug in) by cleaning all resources related to that device, and sitting >> idle waiting for administrator intervention to reconnect (host stack) or >> rebind subsystem to a port (target stack). >> >> I'm thinking that maybe the right behaviour should be to try and restore >> everything as soon as the device becomes available again. I don't think a >> device reset should look different to the users than ports going down and >> up >> again. > > > Hmm, not sure I fully agree here. In my mind device removal means the > device is going away which means there is no point in keeping the controller > around... The same could have been said on a port going down. You don't know if it will come back up connected to the same network... > > AFAIK device resets usually are expected to quiesce inflight I/O, > cleanup resources and restore when the reset sequence completes (which is > what we do in nvme controller resets). I'm not sure I understand why > RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via > rdma_cm or .remove_one via ib_client API). I think the correct interface > would be suspend/resume semantics for RDMA device resets (similar to pm > interface). > > I think that it would make a much cleaner semantics and ULPs should be > able to understand exactly what to do (which is what you suggested > above). > > CCing linux-rdma. Maybe so. I don't know what's the "standard" here for Linux in general and networking devices in particular. Let's see if linux-rdma agree here. > Regardless of ib_client vs. rdma_cm, we can't simply perform normal > reconnects because we have dma mappings we need to unmap for each > request in the tagset which we don't teardown in every reconnect (as > we may have inflight I/O). We could have theoretically use reinit_tagset > to do that though. Obviously it isn't that simple... Just trying to agree on the right direction to go. >> >> In the reconnect flow the stack already repeats creating the cm_id and >> resolving address and route, so when the RDMA device comes back up, and >> assuming it will be configured with the same address and connected to the >> same >> network (as is the case in device reset), connections will be restored >> automatically. > > > As I said, I think that the problem is the interface of RDMA device > resets. IMO, device removal means we need to delete all the nvme > controllers associated with the device. Do you think all associated controllers should be deleted when a TCP socket gets disconnected in NVMe-over-TCP? Do they? > > If we were to handle hotplug events where devices come into the system, > the correct way would be to send a udev event to userspace and not keep > stale controllers around with hope they will come back. userspace is a > much better place to keep a state with respect to these scenarios IMO. That's the important part I'm trying to understand the direction we should go. First, let's agree that the user (admin) expects a simple behaviour: if a configuration was made to connect with a remote storage, the stack (driver, daemons, scripts) should make an effort to keep those connections whenever possible. Yes, it could be a userspace script/daemon job. But I was under the impression that this group tries to consolidate most (all?) of the functionality into the driver, and not rely on userspace daemons. Maybe a lesson learnt from iSCSI? If all are in agreement that this should be done in userspace, that's fine. >> At the target stack things are even worse. When the RDMA device resets or >> disappears the softlink between the port and the subsystem stays >> "hanging". It >> does not represent an active bind, and when the device will come back with >> the >> same address and network it will not start working (even though the >> softlink >> is there). This is quite confusing to the user. > > > Right, I think we would need to reflect port state (active/inactive) via > configfs and nvmetcli could reflect it in its UI. You mean the softlink should disappear in this case? It can't stay as it means nothing (the bond between the port and the subsystem is gone forever the way it is now). But removing the softlink in configfs sounds against nature of things: the admin put it there, it reflects the admin wish to expose a subsystem via a port. This wish did not change... Are there examples of configfs items being changed by the stack against the admin's wish? >> What I suggest here is to implement something similar to the reconnect >> flow at >> the host, and repeat the flow that is doing the rdma_bind_addr. This way, >> again, when the device will come back with the same address and network >> the >> bind will succeed and the subsystem will become functional again. In this >> case >> it makes sense to keep the softlink during all this time, as the stack >> really >> tries to re-bind to the port. > > > I'm sorry but I don't think that is the correct approach. If the device > is removed than we break the association and do nothing else. As for > RDMA device resets, this goes back to the interface problem I pointed > out. Are we in agreement that the user (admin) expects the software stack to keep this bound when possible (like keeping the connections in the initiator case)? After all, the admin has specifically put the softlink there - it expresses the admin's wish. We could agree here too that it is the task of a userspace daemon/script. But then we'll need to keep the entire configuration in another place (meaning configfs alone is not enough anymore), constantly compare it to the current configuration in configfs, and make the adjustments. And we'll need the stack to remove the symlink, which I still think is an odd behaviour. Oren ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-24 7:41 ` Oren Duer @ 2018-01-24 20:52 ` Sagi Grimberg 2018-01-25 14:10 ` Oren Duer 2018-01-25 18:13 ` Doug Ledford 0 siblings, 2 replies; 19+ messages in thread From: Sagi Grimberg @ 2018-01-24 20:52 UTC (permalink / raw) >>> Today host and target stacks will respond to RDMA device reset (or plug >>> out >>> and plug in) by cleaning all resources related to that device, and sitting >>> idle waiting for administrator intervention to reconnect (host stack) or >>> rebind subsystem to a port (target stack). >>> >>> I'm thinking that maybe the right behaviour should be to try and restore >>> everything as soon as the device becomes available again. I don't think a >>> device reset should look different to the users than ports going down and >>> up >>> again. >> >> >> Hmm, not sure I fully agree here. In my mind device removal means the >> device is going away which means there is no point in keeping the controller >> around... > > The same could have been said on a port going down. You don't know if it will > come back up connected to the same network... That's true. However in my mind port events are considered transient, and we do give up at some point. I'm simply arguing that device removal has different semantics. I don't argue that we need to support it. >> AFAIK device resets usually are expected to quiesce inflight I/O, >> cleanup resources and restore when the reset sequence completes (which is >> what we do in nvme controller resets). I'm not sure I understand why >> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via >> rdma_cm or .remove_one via ib_client API). I think the correct interface >> would be suspend/resume semantics for RDMA device resets (similar to pm >> interface). >> >> I think that it would make a much cleaner semantics and ULPs should be >> able to understand exactly what to do (which is what you suggested >> above). >> >> CCing linux-rdma. > > Maybe so. I don't know what's the "standard" here for Linux in general and > networking devices in particular. Let's see if linux-rdma agree here. I would like to hear more opinions on the current interface. >> Regardless of ib_client vs. rdma_cm, we can't simply perform normal >> reconnects because we have dma mappings we need to unmap for each >> request in the tagset which we don't teardown in every reconnect (as >> we may have inflight I/O). We could have theoretically use reinit_tagset >> to do that though. > > Obviously it isn't that simple... Just trying to agree on the right direction > to go. Yea, I agree. It shouldn't be too hard also. >>> In the reconnect flow the stack already repeats creating the cm_id and >>> resolving address and route, so when the RDMA device comes back up, and >>> assuming it will be configured with the same address and connected to the >>> same >>> network (as is the case in device reset), connections will be restored >>> automatically. >> >> >> As I said, I think that the problem is the interface of RDMA device >> resets. IMO, device removal means we need to delete all the nvme >> controllers associated with the device. > > Do you think all associated controllers should be deleted when a TCP socket > gets disconnected in NVMe-over-TCP? Do they? Nope, but that is equivalent to QP going into error state IMO, and we don't do that in nvme-rdma as well. There is a slight difference as tcp controllers are not responsible for releasing any HW resource nor standing in the way of the device to reset itself. In RDMA, the ULP needs to cooperate with the stack, so I think it would be better if the interface would map better to a reset process (i.e. transient). >> If we were to handle hotplug events where devices come into the system, >> the correct way would be to send a udev event to userspace and not keep >> stale controllers around with hope they will come back. userspace is a >> much better place to keep a state with respect to these scenarios IMO. > > That's the important part I'm trying to understand the direction we should go. > First, let's agree that the user (admin) expects a simple behaviour: No argues here.. > if a configuration was made to connect with a remote storage, the stack (driver, > daemons, scripts) should make an effort to keep those connections whenever > possible. True, and in fact Johannes suggested a related topic for LSF: http://lists.infradead.org/pipermail/linux-nvme/2018-January/015159.html For now, we don't have a good way to auto-connect (or auto-reconnect) for IP based nvme transports. > Yes, it could be a userspace script/daemon job. But I was under the impression > that this group tries to consolidate most (all?) of the functionality into the > driver, and not rely on userspace daemons. Maybe a lesson learnt from iSCSI? Indeed that is a guideline that was taken early on. But auto-connect/auto-discovery is not something I think we'd like to implement in the kernel... > You mean the softlink should disappear in this case? > It can't stay as it means nothing (the bond between the port and the subsystem > is gone forever the way it is now). I meant that we expose a port state via configfs. As for device hotplug, maybe the individual transports can propagate udev event to userspace to try to re-enable the port or something... Don't have it all figured out.. >>> What I suggest here is to implement something similar to the reconnect >>> flow at >>> the host, and repeat the flow that is doing the rdma_bind_addr. This way, >>> again, when the device will come back with the same address and network >>> the >>> bind will succeed and the subsystem will become functional again. In this >>> case >>> it makes sense to keep the softlink during all this time, as the stack >>> really >>> tries to re-bind to the port. >> >> >> I'm sorry but I don't think that is the correct approach. If the device >> is removed than we break the association and do nothing else. As for >> RDMA device resets, this goes back to the interface problem I pointed >> out. > > Are we in agreement that the user (admin) expects the software stack to keep > this bound when possible (like keeping the connections in the initiator case)? > After all, the admin has specifically put the softlink there - it expresses > the admin's wish. It will be the case if port binds to INADDR_ANY :) Anyways, I think we agree here (at least partially). I think that we need to reflect port state in configfs (nvmetcli can color it red), and when a device completes reset sequence we get an event that tells us just that we we send it to userspace and re-enable the port... > We could agree here too that it is the task of a userspace daemon/script. But > then we'll need to keep the entire configuration in another place (meaning > configfs alone is not enough anymore), We have nvmetcli for that. we just need a reactor to udev. > constantly compare it to the current configuration in configfs, and make the adjustments. I would say that we should have it driven from changes from the kernel... > And we'll need the stack to remove the symlink, which I still think is an odd > behaviour. No need to remove the symlink. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-24 20:52 ` Sagi Grimberg @ 2018-01-25 14:10 ` Oren Duer 2018-01-29 19:58 ` Sagi Grimberg 2018-01-25 18:13 ` Doug Ledford 1 sibling, 1 reply; 19+ messages in thread From: Oren Duer @ 2018-01-25 14:10 UTC (permalink / raw) >> Yes, it could be a userspace script/daemon job. But I was under the >> impression >> that this group tries to consolidate most (all?) of the functionality into >> the >> driver, and not rely on userspace daemons. Maybe a lesson learnt from >> iSCSI? > > > Indeed that is a guideline that was taken early on. But > auto-connect/auto-discovery is not something I think we'd like to > implement in the kernel... So to summarize your view on this topic: * Agreed that user experience should be such that after an RDMA device reset everything should continue as before without admin intervention. * Both for host and target stacks. * This should not be handled by the driver itself. * A userspace daemon that listens on udev events should retry the reconnects or re-binds using configfs. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-25 14:10 ` Oren Duer @ 2018-01-29 19:58 ` Sagi Grimberg 0 siblings, 0 replies; 19+ messages in thread From: Sagi Grimberg @ 2018-01-29 19:58 UTC (permalink / raw) > So to summarize your view on this topic: > * Agreed that user experience should be such that after an RDMA device reset > everything should continue as before without admin intervention. Yes > * Both for host and target stacks. Yes > * This should not be handled by the driver itself. > * A userspace daemon that listens on udev events should retry the reconnects > or re-binds using configfs. No, I'm saying transient errors should be handled by the driver itself, recovery from permanent errors (like unplug+replug) should be handled outside the kernel. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-24 20:52 ` Sagi Grimberg 2018-01-25 14:10 ` Oren Duer @ 2018-01-25 18:13 ` Doug Ledford 2018-01-25 19:06 ` Chuck Lever ` (2 more replies) 1 sibling, 3 replies; 19+ messages in thread From: Doug Ledford @ 2018-01-25 18:13 UTC (permalink / raw) On Wed, 2018-01-24@22:52 +0200, Sagi Grimberg wrote: > > > > Today host and target stacks will respond to RDMA device reset (or plug > > > > out > > > > and plug in) by cleaning all resources related to that device, and sitting > > > > idle waiting for administrator intervention to reconnect (host stack) or > > > > rebind subsystem to a port (target stack). > > > > > > > > I'm thinking that maybe the right behaviour should be to try and restore > > > > everything as soon as the device becomes available again. I don't think a > > > > device reset should look different to the users than ports going down and > > > > up > > > > again. > > > > > > > > > Hmm, not sure I fully agree here. In my mind device removal means the > > > device is going away which means there is no point in keeping the controller > > > around... > > > > The same could have been said on a port going down. You don't know if it will > > come back up connected to the same network... > > That's true. However in my mind port events are considered transient, > and we do give up at some point. I'm simply arguing that device removal > has different semantics. I don't argue that we need to support it. I think it depends on how you view yourself (meaning the target or initiator stacks). It's my understanding that if device eth0 disappeared completely, and then device eth1 was plugged in, and eth1 got the same ip address as eth0, then as long as any TCP sockets hadn't gone into reset state, the iSCSI devices across the existing connection would simply keep working. This is correct, yes? If so, then maybe you want iSER at least to operate the same way. The problem, of course, is that iSER may use the IP address and ports for connection, but then it transitions to queue pairs for data transfer. Because iSER does that, it is sitting at the same level as, say, the net core that *did* know about the eth change in the above example and transitioned the TCP socket from the old device to the new, meaning that iSER now has to take that same responsibility on itself if it wishes the user visible behavior of iSER devices to be the same as iSCSI devices. And that would even be true if the old RDMA device went away and a new RDMA device came up with the old IP address, so the less drastic form of bouncing the existing device should certainly fall under the same umbrella. I *think* for SRP this is already the case. The SRP target uses the kernel LIO framework, so if you bounce the device under the SRPt layer, doesn't the config get preserved? So that when the device came back up, the LIO configuration would still be there and the SRPt driver would see that? Bart? For the SRP client, I'm almost certain it will try to reconnect since it uses a user space daemon with a shell script that restarts the daemon on various events. That might have changed...didn't we just take a patch to rdma-core to drop the shell script? It might not reconnect automatically with the latest rdma-core, I'd have to check. Bart should know though... I haven't the faintest clue on NVMe over fabrics though. But, again, I think that's up to you guys to decide what semantics you want. With iSER it's a little easier since you can use the TCP semantics as a guideline and you have an IP/port discovery so it doesn't even have to be the same controller that comes back. With SRP it must be the same controller that comes back or else your login information will be all wrong (well, we did just take RDMA_CM support patches for SRP that will allow IP/port addressing instead, so theoretically it could now do the same thing if you are using RDMA_CM mode logins). I don't know the details of the NVMe addressing though. > > > AFAIK device resets usually are expected to quiesce inflight I/O, > > > cleanup resources and restore when the reset sequence completes (which is > > > what we do in nvme controller resets). I think your perspective here might be a bit skewed by the way the NVMe stack is implemented (which was intentional for speed as I understand it). As a differing example, in the SCSI stack when the LLD does a SCSI host reset, it resets the host but does not restore or restart any commands that were aborted. It is up to the upper layer SCSI drivers to do so (if they chose, they might send it back to the block layer). From the way you wrote the above, it sounds like the NVMe layer is almost monolithic in nature with no separation between upper level consumer layer and lower level driver layer, and so you can reset/restart all internally. I would argue that's rare in the linux kernel and most places the low level driver resets, and some other upper layer has to restart things if it wants or error out if it doesn't. > > > I'm not sure I understand why > > > RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via > > > rdma_cm or .remove_one via ib_client API). I think the correct interface > > > would be suspend/resume semantics for RDMA device resets (similar to pm > > > interface). No, we can't do this. Suspend/Resume is not the right model for an RDMA device reset. An RDMA device reset is a hard action that stops all ongoing DMA regardless of its source. Those sources include kernel layer consumers, user space consumers acting without the kernel's direct intervention, and ongoing DMA with remote RDMA peers (which will throw the remote queue pairs into an error state almost immediately). In the future it very likely could include RDMA between things like GPU offload processors too. We can't restart that stuff even if we wanted to. So suspend/resume semantics for an RDMA device level reset is a non- starter. > > > I think that it would make a much cleaner semantics and ULPs should be > > > able to understand exactly what to do (which is what you suggested > > > above). > > > > > > CCing linux-rdma. > > > > Maybe so. I don't know what's the "standard" here for Linux in general and > > networking devices in particular. Let's see if linux-rdma agree here. > > I would like to hear more opinions on the current interface. There is a difference between RDMA device and other network devices. The net stack is much more like the SCSI stack in that you have an upper layer connection (socket or otherwise) and a lower layer transport and the net core code which is free to move your upper layer abstraction from one lower layer transport to another. With the RDMA subsystem, your upper layer is connecting directly into the low level hardware. If you want a semantic that includes reconnection on an event, then it has to be handled in your upper layer as there is no intervening middle layer to abstract out the task of moving your connection from one low level device to another (that's not to say we couldn't create one, and several actually already exist, like SMC-R and RDS, but direct hooks into the core ib stack are not abstracted out and you are talking directly to the hardware). And if you want to support moving your connection from an old removed device to a new replacement device that is not simply the same physical device being plugged back in, then you need an addressing scheme that doesn't rely on the link layer hardware address of the device. > > > Regardless of ib_client vs. rdma_cm, we can't simply perform normal > > > reconnects because we have dma mappings we need to unmap for each > > > request in the tagset which we don't teardown in every reconnect (as > > > we may have inflight I/O). We could have theoretically use reinit_tagset > > > to do that though. > > > > Obviously it isn't that simple... Just trying to agree on the right direction > > to go. > > Yea, I agree. It shouldn't be too hard also. > > > > > In the reconnect flow the stack already repeats creating the cm_id and > > > > resolving address and route, so when the RDMA device comes back up, and > > > > assuming it will be configured with the same address and connected to the > > > > same > > > > network (as is the case in device reset), connections will be restored > > > > automatically. > > > > > > > > > As I said, I think that the problem is the interface of RDMA device > > > resets. IMO, device removal means we need to delete all the nvme > > > controllers associated with the device. > > > > Do you think all associated controllers should be deleted when a TCP socket > > gets disconnected in NVMe-over-TCP? Do they? > > Nope, but that is equivalent to QP going into error state IMO, and we > don't do that in nvme-rdma as well. There is no equivalent in the TCP realm of an RDMA controller reset or an RDMA controller permanent removal event. When dealing with TCP, if the underlying ethernet device is reset, you *might* get a TCP socket reset, you might not. If the underlying ethernet is removed, you might get a socket reset, you might not, depending on how the route to the remote host is re-established. If all IP capable devices in the entire system are removed, your TCP socket will get a reset, and attempts to reconnect will get an error. None of those sound semantically comparable to RDMA device unplug/replug. Again, that's just because the net core never percolates that up to the TCP layer. When you have a driver that has both TCP and RDMA transports, the truth is you are plugging into two very different levels of the kernel and the work you have to do to support one is very different from the other. I don't think it's worthwhile to even talk about trying to treat them equivalently unless you want to take on an address scheme and reset/restart capability in the RDMA side of things that you don't have to have in the TCP side of things. As a user of things like iSER/SRP/NVMe, I would personally like connections to persist across non-fatal events. But the RDMA stack, as it stands, can't reconnect things for you, you would have to do that in your own code. -- Doug Ledford <dledford at redhat.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180125/e266443e/attachment-0001.sig> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-25 18:13 ` Doug Ledford @ 2018-01-25 19:06 ` Chuck Lever 2018-01-29 20:01 ` Sagi Grimberg 2018-01-25 22:48 ` Jason Gunthorpe 2018-01-29 20:36 ` Sagi Grimberg 2 siblings, 1 reply; 19+ messages in thread From: Chuck Lever @ 2018-01-25 19:06 UTC (permalink / raw) > On Jan 25, 2018,@10:13 AM, Doug Ledford <dledford@redhat.com> wrote: > > On Wed, 2018-01-24@22:52 +0200, Sagi Grimberg wrote: >>>>> Today host and target stacks will respond to RDMA device reset (or plug >>>>> out >>>>> and plug in) by cleaning all resources related to that device, and sitting >>>>> idle waiting for administrator intervention to reconnect (host stack) or >>>>> rebind subsystem to a port (target stack). >>>>> >>>>> I'm thinking that maybe the right behaviour should be to try and restore >>>>> everything as soon as the device becomes available again. I don't think a >>>>> device reset should look different to the users than ports going down and >>>>> up >>>>> again. >>>> >>>> >>>> Hmm, not sure I fully agree here. In my mind device removal means the >>>> device is going away which means there is no point in keeping the controller >>>> around... >>> >>> The same could have been said on a port going down. You don't know if it will >>> come back up connected to the same network... >> >> That's true. However in my mind port events are considered transient, >> and we do give up at some point. I'm simply arguing that device removal >> has different semantics. I don't argue that we need to support it. > > I think it depends on how you view yourself (meaning the target or > initiator stacks). It's my understanding that if device eth0 > disappeared completely, and then device eth1 was plugged in, and eth1 > got the same ip address as eth0, then as long as any TCP sockets hadn't > gone into reset state, the iSCSI devices across the existing connection > would simply keep working. This is correct, yes? For NFS/RDMA, I think of the "failover" case where a device is removed, then a new one is plugged in (or an existing cold replacement is made available) with the same IP configuration. On a "hard" NFS mount, we want the upper layers to wait for a new suitable device to be made available, and then to use it to resend any pending RPCs. The workload should continue after a new device is available. Feel free to tell me I'm full of turtles. > If so, then maybe you > want iSER at least to operate the same way. The problem, of course, is > that iSER may use the IP address and ports for connection, but then it > transitions to queue pairs for data transfer. Because iSER does that, > it is sitting at the same level as, say, the net core that *did* know > about the eth change in the above example and transitioned the TCP > socket from the old device to the new, meaning that iSER now has to take > that same responsibility on itself if it wishes the user visible > behavior of iSER devices to be the same as iSCSI devices. And that > would even be true if the old RDMA device went away and a new RDMA > device came up with the old IP address, so the less drastic form of > bouncing the existing device should certainly fall under the same > umbrella. > > I *think* for SRP this is already the case. The SRP target uses the > kernel LIO framework, so if you bounce the device under the SRPt layer, > doesn't the config get preserved? So that when the device came back up, > the LIO configuration would still be there and the SRPt driver would see > that? Bart? > > For the SRP client, I'm almost certain it will try to reconnect since it > uses a user space daemon with a shell script that restarts the daemon on > various events. That might have changed...didn't we just take a patch > to rdma-core to drop the shell script? It might not reconnect > automatically with the latest rdma-core, I'd have to check. Bart should > know though... > > I haven't the faintest clue on NVMe over fabrics though. But, again, I > think that's up to you guys to decide what semantics you want. With > iSER it's a little easier since you can use the TCP semantics as a > guideline and you have an IP/port discovery so it doesn't even have to > be the same controller that comes back. With SRP it must be the same > controller that comes back or else your login information will be all > wrong (well, we did just take RDMA_CM support patches for SRP that will > allow IP/port addressing instead, so theoretically it could now do the > same thing if you are using RDMA_CM mode logins). I don't know the > details of the NVMe addressing though. > >>>> AFAIK device resets usually are expected to quiesce inflight I/O, >>>> cleanup resources and restore when the reset sequence completes (which is >>>> what we do in nvme controller resets). > > I think your perspective here might be a bit skewed by the way the NVMe > stack is implemented (which was intentional for speed as I understand > it). As a differing example, in the SCSI stack when the LLD does a SCSI > host reset, it resets the host but does not restore or restart any > commands that were aborted. It is up to the upper layer SCSI drivers to > do so (if they chose, they might send it back to the block layer). From > the way you wrote the above, it sounds like the NVMe layer is almost > monolithic in nature with no separation between upper level consumer > layer and lower level driver layer, and so you can reset/restart all > internally. I would argue that's rare in the linux kernel and most > places the low level driver resets, and some other upper layer has to > restart things if it wants or error out if it doesn't. > >>>> I'm not sure I understand why >>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via >>>> rdma_cm or .remove_one via ib_client API). I think the correct interface >>>> would be suspend/resume semantics for RDMA device resets (similar to pm >>>> interface). > > No, we can't do this. Suspend/Resume is not the right model for an RDMA > device reset. An RDMA device reset is a hard action that stops all > ongoing DMA regardless of its source. Those sources include kernel > layer consumers, user space consumers acting without the kernel's direct > intervention, and ongoing DMA with remote RDMA peers (which will throw > the remote queue pairs into an error state almost immediately). In the > future it very likely could include RDMA between things like GPU offload > processors too. We can't restart that stuff even if we wanted to. So > suspend/resume semantics for an RDMA device level reset is a non- > starter. > >>>> I think that it would make a much cleaner semantics and ULPs should be >>>> able to understand exactly what to do (which is what you suggested >>>> above). >>>> >>>> CCing linux-rdma. >>> >>> Maybe so. I don't know what's the "standard" here for Linux in general and >>> networking devices in particular. Let's see if linux-rdma agree here. >> >> I would like to hear more opinions on the current interface. > > There is a difference between RDMA device and other network devices. > The net stack is much more like the SCSI stack in that you have an upper > layer connection (socket or otherwise) and a lower layer transport and > the net core code which is free to move your upper layer abstraction > from one lower layer transport to another. With the RDMA subsystem, > your upper layer is connecting directly into the low level hardware. If > you want a semantic that includes reconnection on an event, then it has > to be handled in your upper layer as there is no intervening middle > layer to abstract out the task of moving your connection from one low > level device to another (that's not to say we couldn't create one, and > several actually already exist, like SMC-R and RDS, but direct hooks > into the core ib stack are not abstracted out and you are talking > directly to the hardware). And if you want to support moving your > connection from an old removed device to a new replacement device that > is not simply the same physical device being plugged back in, then you > need an addressing scheme that doesn't rely on the link layer hardware > address of the device. > >>>> Regardless of ib_client vs. rdma_cm, we can't simply perform normal >>>> reconnects because we have dma mappings we need to unmap for each >>>> request in the tagset which we don't teardown in every reconnect (as >>>> we may have inflight I/O). We could have theoretically use reinit_tagset >>>> to do that though. >>> >>> Obviously it isn't that simple... Just trying to agree on the right direction >>> to go. >> >> Yea, I agree. It shouldn't be too hard also. >> >>>>> In the reconnect flow the stack already repeats creating the cm_id and >>>>> resolving address and route, so when the RDMA device comes back up, and >>>>> assuming it will be configured with the same address and connected to the >>>>> same >>>>> network (as is the case in device reset), connections will be restored >>>>> automatically. >>>> >>>> >>>> As I said, I think that the problem is the interface of RDMA device >>>> resets. IMO, device removal means we need to delete all the nvme >>>> controllers associated with the device. >>> >>> Do you think all associated controllers should be deleted when a TCP socket >>> gets disconnected in NVMe-over-TCP? Do they? >> >> Nope, but that is equivalent to QP going into error state IMO, and we >> don't do that in nvme-rdma as well. > > There is no equivalent in the TCP realm of an RDMA controller reset or > an RDMA controller permanent removal event. When dealing with TCP, if > the underlying ethernet device is reset, you *might* get a TCP socket > reset, you might not. If the underlying ethernet is removed, you might > get a socket reset, you might not, depending on how the route to the > remote host is re-established. If all IP capable devices in the entire > system are removed, your TCP socket will get a reset, and attempts to > reconnect will get an error. > > None of those sound semantically comparable to RDMA device > unplug/replug. Again, that's just because the net core never percolates > that up to the TCP layer. > > When you have a driver that has both TCP and RDMA transports, the truth > is you are plugging into two very different levels of the kernel and the > work you have to do to support one is very different from the other. I > don't think it's worthwhile to even talk about trying to treat them > equivalently unless you want to take on an address scheme and > reset/restart capability in the RDMA side of things that you don't have > to have in the TCP side of things. > > As a user of things like iSER/SRP/NVMe, I would personally like > connections to persist across non-fatal events. But the RDMA stack, as > it stands, can't reconnect things for you, you would have to do that in > your own code. > > -- > Doug Ledford <dledford at redhat.com> > GPG KeyID: B826A3330E572FDD > Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-25 19:06 ` Chuck Lever @ 2018-01-29 20:01 ` Sagi Grimberg 2018-01-29 20:11 ` Chuck Lever 0 siblings, 1 reply; 19+ messages in thread From: Sagi Grimberg @ 2018-01-29 20:01 UTC (permalink / raw) Hi Chuck, > For NFS/RDMA, I think of the "failover" case where a device is > removed, then a new one is plugged in (or an existing cold > replacement is made available) with the same IP configuration. > > On a "hard" NFS mount, we want the upper layers to wait for > a new suitable device to be made available, and then to use > it to resend any pending RPCs. The workload should continue > after a new device is available. Really? so the context is held forever (in case the device never comes back)? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-29 20:01 ` Sagi Grimberg @ 2018-01-29 20:11 ` Chuck Lever 2018-01-29 21:27 ` Doug Ledford 0 siblings, 1 reply; 19+ messages in thread From: Chuck Lever @ 2018-01-29 20:11 UTC (permalink / raw) > On Jan 29, 2018,@3:01 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > > Hi Chuck, > >> For NFS/RDMA, I think of the "failover" case where a device is >> removed, then a new one is plugged in (or an existing cold >> replacement is made available) with the same IP configuration. >> On a "hard" NFS mount, we want the upper layers to wait for >> a new suitable device to be made available, and then to use >> it to resend any pending RPCs. The workload should continue >> after a new device is available. > > Really? so the context is held forever (in case the device never > comes back)? I didn't say this was the best approach :-) And it certainly can change if we have something better. But yes, with a hard mount, the NFS and RPC client stack keeps the pending RPCs around and continues to attempt reconnection with the NFS server. The idea is that after an unplug, another device with the proper IP configuration can be made available, and then rdma_resolve_addr() can figure out how to reconnect. The associated NFS workload will be suspended until it can reconnect. Now on the NFS server (target) an unplug results in connection abort. Any context at the transport layer is gone, though the NFS server maintains duplicate reply caches that can hold RPC replies for some time. Those are all bounded in size. The clients continue to attempt to reconnect until there is another device available that can allow the server to accept connections. -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-29 20:11 ` Chuck Lever @ 2018-01-29 21:27 ` Doug Ledford 2018-01-29 21:46 ` Chuck Lever 0 siblings, 1 reply; 19+ messages in thread From: Doug Ledford @ 2018-01-29 21:27 UTC (permalink / raw) On Mon, 2018-01-29@15:11 -0500, Chuck Lever wrote: > > On Jan 29, 2018,@3:01 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Hi Chuck, > > > > > For NFS/RDMA, I think of the "failover" case where a device is > > > removed, then a new one is plugged in (or an existing cold > > > replacement is made available) with the same IP configuration. > > > On a "hard" NFS mount, we want the upper layers to wait for > > > a new suitable device to be made available, and then to use > > > it to resend any pending RPCs. The workload should continue > > > after a new device is available. > > > > Really? so the context is held forever (in case the device never > > comes back)? > > I didn't say this was the best approach :-) And it certainly can > change if we have something better. Whether it's the best or not, it's the defined behavior of the "hard" mount option. So if someone doesn't want that, you don't use a hard mount ;-) Hard mounts are great for situations where you have a high degree of faith that even if they server disappears, it will reappear soon. They suck when the server totally dies though, because now all the hard mount clients are stuck :-/. -- Doug Ledford <dledford at redhat.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180129/178ad2ea/attachment-0001.sig> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-29 21:27 ` Doug Ledford @ 2018-01-29 21:46 ` Chuck Lever 0 siblings, 0 replies; 19+ messages in thread From: Chuck Lever @ 2018-01-29 21:46 UTC (permalink / raw) > On Jan 29, 2018,@4:27 PM, Doug Ledford <dledford@redhat.com> wrote: > > On Mon, 2018-01-29@15:11 -0500, Chuck Lever wrote: >>> On Jan 29, 2018,@3:01 PM, Sagi Grimberg <sagi@grimberg.me> wrote: >>> >>> Hi Chuck, >>> >>>> For NFS/RDMA, I think of the "failover" case where a device is >>>> removed, then a new one is plugged in (or an existing cold >>>> replacement is made available) with the same IP configuration. >>>> On a "hard" NFS mount, we want the upper layers to wait for >>>> a new suitable device to be made available, and then to use >>>> it to resend any pending RPCs. The workload should continue >>>> after a new device is available. >>> >>> Really? so the context is held forever (in case the device never >>> comes back)? >> >> I didn't say this was the best approach :-) And it certainly can >> change if we have something better. > > Whether it's the best or not, it's the defined behavior of the "hard" > mount option. So if someone doesn't want that, you don't use a hard > mount ;-) > > Hard mounts are great for situations where you have a high degree of > faith that even if they server disappears, it will reappear soon. They > suck when the server totally dies though, because now all the hard mount > clients are stuck :-/. We're working on fixing that. -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-25 18:13 ` Doug Ledford 2018-01-25 19:06 ` Chuck Lever @ 2018-01-25 22:48 ` Jason Gunthorpe 2018-01-29 20:36 ` Sagi Grimberg 2 siblings, 0 replies; 19+ messages in thread From: Jason Gunthorpe @ 2018-01-25 22:48 UTC (permalink / raw) On Thu, Jan 25, 2018@01:13:42PM -0500, Doug Ledford wrote: > various events. That might have changed...didn't we just take a patch > to rdma-core to drop the shell script? It might not reconnect > automatically with the latest rdma-core, I'd have to check. Bart should > know though... We dropped the shell script in favor of udev. srp_daemon will now stop when rdma devices are removed and restart again when they are added. Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-25 18:13 ` Doug Ledford 2018-01-25 19:06 ` Chuck Lever 2018-01-25 22:48 ` Jason Gunthorpe @ 2018-01-29 20:36 ` Sagi Grimberg 2018-01-29 21:34 ` Bart Van Assche 2018-01-29 22:28 ` Doug Ledford 2 siblings, 2 replies; 19+ messages in thread From: Sagi Grimberg @ 2018-01-29 20:36 UTC (permalink / raw) > I *think* for SRP this is already the case. The SRP target uses the > kernel LIO framework, so if you bounce the device under the SRPt layer, > doesn't the config get preserved? So that when the device came back up, > the LIO configuration would still be there and the SRPt driver would see > that? Bart? I think you're right. I think we can do that if we keep the listener cm_id device node_guid and when a new device comes in we can see if we have a cm listener on that device and re-listen. That is a good idea Doug. > For the SRP client, I'm almost certain it will try to reconnect since it > uses a user space daemon with a shell script that restarts the daemon on > various events. That might have changed...didn't we just take a patch > to rdma-core to drop the shell script? It might not reconnect > automatically with the latest rdma-core, I'd have to check. Bart should > know though... srp driver relies on srp_daemon to discover and connect again over the new device. iSER relies on iscsiadm to reconnect. I guess it should be the correct approach for nvme as well (which we don't have at the moment)... >>>> AFAIK device resets usually are expected to quiesce inflight I/O, >>>> cleanup resources and restore when the reset sequence completes (which is >>>> what we do in nvme controller resets). > > I think your perspective here might be a bit skewed by the way the NVMe > stack is implemented (which was intentional for speed as I understand > it). As a differing example, in the SCSI stack when the LLD does a SCSI > host reset, it resets the host but does not restore or restart any > commands that were aborted. It is up to the upper layer SCSI drivers to > do so (if they chose, they might send it back to the block layer). From > the way you wrote the above, it sounds like the NVMe layer is almost > monolithic in nature with no separation between upper level consumer > layer and lower level driver layer, and so you can reset/restart all > internally. I would argue that's rare in the linux kernel and most > places the low level driver resets, and some other upper layer has to > restart things if it wants or error out if it doesn't. That is the case for nvme as well, but I was merely saying that device reset is not really a device removal. And this makes it hard for the ULP to understand what to do (or for me at least...) >>>> I'm not sure I understand why >>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via >>>> rdma_cm or .remove_one via ib_client API). I think the correct interface >>>> would be suspend/resume semantics for RDMA device resets (similar to pm >>>> interface). > > No, we can't do this. Suspend/Resume is not the right model for an RDMA > device reset. An RDMA device reset is a hard action that stops all > ongoing DMA regardless of its source. Suspend also requires that. > Those sources include kernel > layer consumers, user space consumers acting without the kernel's direct > intervention, and ongoing DMA with remote RDMA peers (which will throw > the remote queue pairs into an error state almost immediately). In the > future it very likely could include RDMA between things like GPU offload > processors too. We can't restart that stuff even if we wanted to. So > suspend/resume semantics for an RDMA device level reset is a non- > starter. I see. I can understand the argument "we are stuck with what we have" for user-space, but does that mandate that we must live with that for kernel consumers as well? Even if the semantics is confusing? (Just asking, its only my opinion :)) >>>> I think that it would make a much cleaner semantics and ULPs should be >>>> able to understand exactly what to do (which is what you suggested >>>> above). >>>> >>>> CCing linux-rdma. >>> >>> Maybe so. I don't know what's the "standard" here for Linux in general and >>> networking devices in particular. Let's see if linux-rdma agree here. >> >> I would like to hear more opinions on the current interface. > > There is a difference between RDMA device and other network devices. > The net stack is much more like the SCSI stack in that you have an upper > layer connection (socket or otherwise) and a lower layer transport and > the net core code which is free to move your upper layer abstraction > from one lower layer transport to another. With the RDMA subsystem, > your upper layer is connecting directly into the low level hardware. If > you want a semantic that includes reconnection on an event, then it has > to be handled in your upper layer as there is no intervening middle > layer to abstract out the task of moving your connection from one low > level device to another (that's not to say we couldn't create one, and > several actually already exist, like SMC-R and RDS, but direct hooks > into the core ib stack are not abstracted out and you are talking > directly to the hardware). And if you want to support moving your > connection from an old removed device to a new replacement device that > is not simply the same physical device being plugged back in, then you > need an addressing scheme that doesn't rely on the link layer hardware > address of the device. Actually, I didn't suggest that at all. I fully agree that the ULP needs to cooperate with the core and the HW as its holding physical resources. All I suggested is that the core would reflect that the device is resetting and not reflect that the device is going away, and after that a new device comes in, that happens to be the same device... > As a user of things like iSER/SRP/NVMe, I would personally like > connections to persist across non-fatal events. But the RDMA stack, as > it stands, can't reconnect things for you, you would have to do that in > your own code. Again, I fully agree. Didn't mean that the core would handle everything for the consumer of the device at all. I just think that the interface can improve such that the consumers life (and code) would be easier. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-29 20:36 ` Sagi Grimberg @ 2018-01-29 21:34 ` Bart Van Assche 2018-01-29 22:28 ` Doug Ledford 1 sibling, 0 replies; 19+ messages in thread From: Bart Van Assche @ 2018-01-29 21:34 UTC (permalink / raw) On Mon, 2018-01-29@22:36 +0200, Sagi Grimberg wrote: > > I *think* for SRP this is already the case. The SRP target uses the > > kernel LIO framework, so if you bounce the device under the SRPt layer, > > doesn't the config get preserved? So that when the device came back up, > > the LIO configuration would still be there and the SRPt driver would see > > that? Bart? > > I think you're right. I think we can do that if we keep the listener > cm_id device node_guid and when a new device comes in we can see if we > have a cm listener on that device and re-listen. That is a good idea > Doug. Sorry that I hadn't noticed this e-mail thread earlier and that I had not yet replied. The SRPT config should get preserved as long as the device removal function (srpt_remove_one()) does not get called. > > For the SRP client, I'm almost certain it will try to reconnect since it > > uses a user space daemon with a shell script that restarts the daemon on > > various events. That might have changed...didn't we just take a patch > > to rdma-core to drop the shell script? It might not reconnect > > automatically with the latest rdma-core, I'd have to check. Bart should > > know though... > > srp driver relies on srp_daemon to discover and connect again over the > new device. iSER relies on iscsiadm to reconnect. I guess it should be > the correct approach for nvme as well (which we don't have at the > moment)... There are two mechanisms for the SRP initiator to make it reconnect to an SRP target: 1. srp_daemon. Even with the latest rdma-core changes srp_daemon should still discover SRP targets and reconnect to the target systems it is allowed to reconnect to by its configuration file. 2. The reconnection mechanism in the SCSI SRP transport layer. See also the documentation of the reconnect_delay in https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-transport-srp Bart. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-29 20:36 ` Sagi Grimberg 2018-01-29 21:34 ` Bart Van Assche @ 2018-01-29 22:28 ` Doug Ledford 2018-01-30 15:03 ` Oren Duer 1 sibling, 1 reply; 19+ messages in thread From: Doug Ledford @ 2018-01-29 22:28 UTC (permalink / raw) On Mon, 2018-01-29@22:36 +0200, Sagi Grimberg wrote: > > > That is the case for nvme as well, but I was merely saying that device > reset is not really a device removal. And this makes it hard for the ULP > to understand what to do (or for me at least...) OK, I get that the difference between the two is making it hard to understand what to do. But, the truth of the issue is that whether you are doing a reset or a remove/add cycle, what *your* code needs to do doesn't change. For both cases, your code must A) drop everything on the floor like a hot potato and B) restart from scratch. The only thing that's confusing you is that it's more or less assumed on a reset that you would auto-restart, where as it isn't so clear that you would want to do the same on a remove/add cycle. I think the answer to your question is: if the same device comes back that went away, then yes, auto-restart would seem appropriate. If you make that policy decision, then the *only* difference between device reset and device hot-replug is that you actually have to verify that the same device came back as went away. As an optional item, you could start a timer when the device disappears, and if it takes more than, say, 10 minutes to reappear, you could cancel the auto-restart on the basis that someone probably physically unplugged and replugged the card and they might not want that. But really, aside from the fact that the hot plug flow needs you to check the same device comes back, reset and hot plug have the exact same requirements/needs and can be serviced by a single code path. > > > > > I'm not sure I understand why > > > > > RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via > > > > > rdma_cm or .remove_one via ib_client API). I think the correct interface > > > > > would be suspend/resume semantics for RDMA device resets (similar to pm > > > > > interface). > > > > No, we can't do this. Suspend/Resume is not the right model for an RDMA > > device reset. An RDMA device reset is a hard action that stops all > > ongoing DMA regardless of its source. > > Suspend also requires that. But suspend has a locale semantic of "local to this machine" and usually at least attempts to stop gracefully. Because RDMA allows for things such as a remote machine doing an RDMA READ when we suspend, we can't even attempt the normal graceful shutdown and are left with only the nuclear reset option. In addition, if you reset a network card, the network card's registers don't disappear, and your PCI MMIO region doesn't go away. When you reset an RDMA adapter, all of allocated memory regions for card communications that have been handed out to kernel space, user space, etc. *do* disappear. That isn't really like the suspend semantic. You don't have the option of cleanly stopping things and quiescing the system prior to suspend, because your basic communication channel is gone already. From this point of view, the hot remove semantic is very fitting. The entire card didn't get hot removed, but certainly all of those allocated communication channels very well did. > > Those sources include kernel > > layer consumers, user space consumers acting without the kernel's direct > > intervention, and ongoing DMA with remote RDMA peers (which will throw > > the remote queue pairs into an error state almost immediately). In the > > future it very likely could include RDMA between things like GPU offload > > processors too. We can't restart that stuff even if we wanted to. So > > suspend/resume semantics for an RDMA device level reset is a non- > > starter. > > I see. I can understand the argument "we are stuck with what we have" > for user-space, but does that mandate that we must live with that for > kernel consumers as well? Even if the semantics is confusing? (Just > asking, its only my opinion :)) See above. It's not about user versus kernel space, it's that we really did hot-remove a bunch of resources, even if not the card itself. -- Doug Ledford <dledford at redhat.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180129/29b6457c/attachment.sig> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-29 22:28 ` Doug Ledford @ 2018-01-30 15:03 ` Oren Duer 2018-01-30 17:24 ` Doug Ledford 0 siblings, 1 reply; 19+ messages in thread From: Oren Duer @ 2018-01-30 15:03 UTC (permalink / raw) On Tue, Jan 30, 2018@12:28 AM, Doug Ledford <dledford@redhat.com> wrote: > On Mon, 2018-01-29@22:36 +0200, Sagi Grimberg wrote: >> > >> That is the case for nvme as well, but I was merely saying that device >> reset is not really a device removal. And this makes it hard for the ULP >> to understand what to do (or for me at least...) > > OK, I get that the difference between the two is making it hard to > understand what to do. But, the truth of the issue is that whether you > are doing a reset or a remove/add cycle, what *your* code needs to do > doesn't change. For both cases, your code must A) drop everything on > the floor like a hot potato and B) restart from scratch. Fully agree here. You don't want different code flow for reset. You already have remove/add flows, which require to to act the right way as Doug described. A reset is exactly remove and add of the same device. > The only thing > that's confusing you is that it's more or less assumed on a reset that > you would auto-restart, where as it isn't so clear that you would want > to do the same on a remove/add cycle. I think the answer to your > question is: if the same device comes back that went away, then yes, > auto-restart would seem appropriate. If you make that policy decision, > then the *only* difference between device reset and device hot-replug is > that you actually have to verify that the same device came back as went > away. > > As an optional item, you could start a timer when the device disappears, > and if it takes more than, say, 10 minutes to reappear, you could cancel > the auto-restart on the basis that someone probably physically unplugged > and replugged the card and they might not want that. But really, aside > from the fact that the hot plug flow needs you to check the same device > comes back, reset and hot plug have the exact same requirements/needs > and can be serviced by a single code path. Not sure why we need to keep track whether it is the same device or not. I fail to understand why we trust the system admin to create the connections at the beginning, but we should not trust him anymore if the device was removed, a new device was added in place of it, and it was configured with the same network IP/subnet. To me it looks like the admin wanted exactly that: for the connections to be restored over the new device. If the admin does not want the reconnections to happen, he has the option to explicitly request to disconnect. Same if the device was removed forever. And to help us with all that, we have rdma_cm. As long as we repeat the rdma_connect() on the initiator side and rdma_bind/resolve_addr()... on the target side, we'll get exactly this behaviour. If we want this using userspace daemons - that's fine. We'll need one for the initiator stack and one for the target stack. And we'll probably need some missing udev events and configfs entry? -- Oren ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-30 15:03 ` Oren Duer @ 2018-01-30 17:24 ` Doug Ledford 2018-01-30 17:51 ` Steve Wise 0 siblings, 1 reply; 19+ messages in thread From: Doug Ledford @ 2018-01-30 17:24 UTC (permalink / raw) On Tue, 2018-01-30@17:03 +0200, Oren Duer wrote: > Not sure why we need to keep track whether it is the same device or not. > I fail to understand why we trust the system admin to create the connections > at the beginning, but we should not trust him anymore if the device was > removed, a new device was added in place of it, and it was configured with the > same network IP/subnet. To me it looks like the admin wanted exactly that: > for the connections to be restored over the new device. In my original email, I pointed out that I didn't know how the NVMe over Fabrics was doing addressing. If it were like SRP, then it wouldn't work because the device GUID is part of the addressing scheme, so a new device wouldn't automatically show as matching that address. If, however, you use IP like iSER, then yes, I agree fully you can just test the ability to route over the new device using the IP address and restore if it works. -- Doug Ledford <dledford at redhat.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180130/52de7d54/attachment.sig> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Reconnect on RDMA device reset 2018-01-30 17:24 ` Doug Ledford @ 2018-01-30 17:51 ` Steve Wise 0 siblings, 0 replies; 19+ messages in thread From: Steve Wise @ 2018-01-30 17:51 UTC (permalink / raw) > > In my original email, I pointed out that I didn't know how the NVMe over > Fabrics was doing addressing. If it were like SRP, then it wouldn't > work because the device GUID is part of the addressing scheme, so a new > device wouldn't automatically show as matching that address. If, > however, you use IP like iSER, then yes, I agree fully you can just test > the ability to route over the new device using the IP address and > restore if it works. NVME-oF/rdma uses the rdma_cm with ip addresses, ip ports, etc... ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-01-30 17:51 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-22 15:07 Reconnect on RDMA device reset Oren Duer 2018-01-23 12:42 ` Sagi Grimberg 2018-01-24 7:41 ` Oren Duer 2018-01-24 20:52 ` Sagi Grimberg 2018-01-25 14:10 ` Oren Duer 2018-01-29 19:58 ` Sagi Grimberg 2018-01-25 18:13 ` Doug Ledford 2018-01-25 19:06 ` Chuck Lever 2018-01-29 20:01 ` Sagi Grimberg 2018-01-29 20:11 ` Chuck Lever 2018-01-29 21:27 ` Doug Ledford 2018-01-29 21:46 ` Chuck Lever 2018-01-25 22:48 ` Jason Gunthorpe 2018-01-29 20:36 ` Sagi Grimberg 2018-01-29 21:34 ` Bart Van Assche 2018-01-29 22:28 ` Doug Ledford 2018-01-30 15:03 ` Oren Duer 2018-01-30 17:24 ` Doug Ledford 2018-01-30 17:51 ` Steve Wise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).