* nvme_rdma - leaves provider resources allocated @ 2016-08-23 16:58 Steve Wise 2016-08-24 9:31 ` Sagi Grimberg 0 siblings, 1 reply; 6+ messages in thread From: Steve Wise @ 2016-08-23 16:58 UTC (permalink / raw) Assume an nvme_rdma host has one attached controller in RECONNECTING state, and that controller has failed to reconnect at least once and thus is in the delay_schedule time before retrying the connection. At that moment, there are no cm_ids allocated for that controller because the admin queue and the io queues have been freed. So nvme_rdma cannot get a DEVICE_REMOVAL from the rdma_cm. This means if the underlying provider module is removed, it will be removed with resources still allocated by nvme_rdma. For iw_cxgb4, this causes a BUG_ON() in gen_pool_destroy() because MRs are still allocated for the controller. Thoughts on how to fix this? Thanks, Steve. ^ permalink raw reply [flat|nested] 6+ messages in thread
* nvme_rdma - leaves provider resources allocated 2016-08-23 16:58 nvme_rdma - leaves provider resources allocated Steve Wise @ 2016-08-24 9:31 ` Sagi Grimberg 2016-08-24 14:09 ` Steve Wise 0 siblings, 1 reply; 6+ messages in thread From: Sagi Grimberg @ 2016-08-24 9:31 UTC (permalink / raw) > Assume an nvme_rdma host has one attached controller in RECONNECTING state, and > that controller has failed to reconnect at least once and thus is in the > delay_schedule time before retrying the connection. At that moment, there are > no cm_ids allocated for that controller because the admin queue and the io > queues have been freed. So nvme_rdma cannot get a DEVICE_REMOVAL from the > rdma_cm. This means if the underlying provider module is removed, it will be > removed with resources still allocated by nvme_rdma. For iw_cxgb4, this causes > a BUG_ON() in gen_pool_destroy() because MRs are still allocated for the > controller. > > Thoughts on how to fix this? Hey Steve, I think it's time to go back to your client register proposal. I can't think of any way to get it right at the moment... Maybe if we can make it only do something meaningful in remove_one() to handle device removal we can get away with it... ^ permalink raw reply [flat|nested] 6+ messages in thread
* nvme_rdma - leaves provider resources allocated 2016-08-24 9:31 ` Sagi Grimberg @ 2016-08-24 14:09 ` Steve Wise 2016-08-25 21:52 ` Sagi Grimberg 0 siblings, 1 reply; 6+ messages in thread From: Steve Wise @ 2016-08-24 14:09 UTC (permalink / raw) > > > Assume an nvme_rdma host has one attached controller in RECONNECTING state, > and > > that controller has failed to reconnect at least once and thus is in the > > delay_schedule time before retrying the connection. At that moment, there are > > no cm_ids allocated for that controller because the admin queue and the io > > queues have been freed. So nvme_rdma cannot get a DEVICE_REMOVAL from > the > > rdma_cm. This means if the underlying provider module is removed, it will be > > removed with resources still allocated by nvme_rdma. For iw_cxgb4, this causes > > a BUG_ON() in gen_pool_destroy() because MRs are still allocated for the > > controller. > > > > Thoughts on how to fix this? > > Hey Steve, > > I think it's time to go back to your client register proposal. > > I can't think of any way to get it right at the moment... > > Maybe if we can make it only do something meaningful in remove_one() > to handle device removal we can get away with it... Hey Sagi, I'm finalizing a WIP series that provides a different approach. (we can certainly reconsider my ib_client patch too). But my WIP adds the concept of an "unplug" cm_id for each nvme_rdma_ctrl controller. When the controller is first created and the admin qp is connected to the target, the unplug_cm_id is created and address resolution is done on it to bind it to the same device that the admin QP is bound to. This unplug_cm_id remains across any/all kato recovery and thus will always be available for DEVICE_REMOVAL events. This simplifies the unplug handler because the cm_id isn't associated with any of the IO queues nor the admin queue. I also found another bug: if the reconnect worker times out waiting for rdma connection setup on an IO or admin QP, a QP is leaked. I'm looking into this as well. Do you have any thoughts on the controller reference around deletion issue I posted? http://lists.infradead.org/pipermail/linux-nvme/2016-August/005919.html Thanks! Steve. ^ permalink raw reply [flat|nested] 6+ messages in thread
* nvme_rdma - leaves provider resources allocated 2016-08-24 14:09 ` Steve Wise @ 2016-08-25 21:52 ` Sagi Grimberg 2016-08-25 22:03 ` Steve Wise 0 siblings, 1 reply; 6+ messages in thread From: Sagi Grimberg @ 2016-08-25 21:52 UTC (permalink / raw) > Hey Sagi, > > I'm finalizing a WIP series that provides a different approach. (we can > certainly reconsider my ib_client patch too). But my WIP adds the concept of an > "unplug" cm_id for each nvme_rdma_ctrl controller. When the controller is first > created and the admin qp is connected to the target, the unplug_cm_id is created > and address resolution is done on it to bind it to the same device that the > admin QP is bound to. This unplug_cm_id remains across any/all kato recovery > and thus will always be available for DEVICE_REMOVAL events. This simplifies > the unplug handler because the cm_id isn't associated with any of the IO queues > nor the admin queue. OK, let's wait for the patches... > I also found another bug: if the reconnect worker times out waiting for rdma > connection setup on an IO or admin QP, a QP is leaked. I'm looking into this > as well. Hmm, I think you're right. If we passed address resolution but failed route (or got a general CONNECT/UNREACHABLE errors) we won't free the queue... I think this should fix this: -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index ab545fb347a0..452727c5ea13 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1382,10 +1382,12 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, case RDMA_CM_EVENT_REJECTED: cm_error = nvme_rdma_conn_rejected(queue, ev); break; - case RDMA_CM_EVENT_ADDR_ERROR: case RDMA_CM_EVENT_ROUTE_ERROR: case RDMA_CM_EVENT_CONNECT_ERROR: case RDMA_CM_EVENT_UNREACHABLE: + nvme_rdma_destroy_queue_ib(queue); + case RDMA_CM_EVENT_ADDR_ERROR: + /* FALLTHRU */ dev_dbg(queue->ctrl->ctrl.device, "CM error event %d\n", ev->event); cm_error = -ECONNRESET; -- > > Do you have any thoughts on the controller reference around deletion issue I > posted? > > http://lists.infradead.org/pipermail/linux-nvme/2016-August/005919.html Looks fine, but you need to use kref_get_unless_zero. ^ permalink raw reply related [flat|nested] 6+ messages in thread
* nvme_rdma - leaves provider resources allocated 2016-08-25 21:52 ` Sagi Grimberg @ 2016-08-25 22:03 ` Steve Wise 2016-08-25 22:06 ` Sagi Grimberg 0 siblings, 1 reply; 6+ messages in thread From: Steve Wise @ 2016-08-25 22:03 UTC (permalink / raw) > > Hey Sagi, > > > > I'm finalizing a WIP series that provides a different approach. (we can > > certainly reconsider my ib_client patch too). But my WIP adds the concept of an > > "unplug" cm_id for each nvme_rdma_ctrl controller. When the controller is first > > created and the admin qp is connected to the target, the unplug_cm_id is created > > and address resolution is done on it to bind it to the same device that the > > admin QP is bound to. This unplug_cm_id remains across any/all kato recovery > > and thus will always be available for DEVICE_REMOVAL events. This simplifies > > the unplug handler because the cm_id isn't associated with any of the IO queues > > nor the admin queue. > > OK, let's wait for the patches... I'll send them tomorrow. They're still Work In Progress (WIP) though. There is at least one other bug I'm chasing. > > > I also found another bug: if the reconnect worker times out waiting for rdma > > connection setup on an IO or admin QP, a QP is leaked. I'm looking into this > > as well. > > Hmm, I think you're right. If we passed address resolution but failed > route (or got a general CONNECT/UNREACHABLE errors) we won't free the > queue... > > I think this should fix this: > -- > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index ab545fb347a0..452727c5ea13 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -1382,10 +1382,12 @@ static int nvme_rdma_cm_handler(struct > rdma_cm_id *cm_id, > case RDMA_CM_EVENT_REJECTED: > cm_error = nvme_rdma_conn_rejected(queue, ev); > break; > - case RDMA_CM_EVENT_ADDR_ERROR: > case RDMA_CM_EVENT_ROUTE_ERROR: > case RDMA_CM_EVENT_CONNECT_ERROR: > case RDMA_CM_EVENT_UNREACHABLE: > + nvme_rdma_destroy_queue_ib(queue); > + case RDMA_CM_EVENT_ADDR_ERROR: > + /* FALLTHRU */ > dev_dbg(queue->ctrl->ctrl.device, > "CM error event %d\n", ev->event); > cm_error = -ECONNRESET; > -- This won't quite do it. The reconnect thread can timeout and give up before any event arrives, and that path also needs to destroy the queue resources if it was allocated. This is my current patch. I appreciate any comments. (but I'll resend this formally tomorrow with other changes)... --- nvme-rdma: destroy nvme queue rdma resources on connect failure From: Steve Wise <swise@opengridcomputing.com> After address resolution, the nvme_rdma_queue rdma resources are allocated. If rdma route resolution or the connect fails, or the controller reconnect times out and gives up, then the rdma resources need to be freed. Otherwise, rdma resources are leaked. Signed-off-by: Steve Wise <swise at opengridcomputing.com> --- drivers/nvme/host/rdma.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 6198eaa..ea271df 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -87,6 +87,7 @@ struct nvme_rdma_request { enum nvme_rdma_queue_flags { NVME_RDMA_Q_CONNECTED = (1 << 0), + NVME_RDMA_IB_QUEUE_ALLOCATED = (1 << 1), }; struct nvme_rdma_queue { @@ -488,6 +489,7 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue) struct nvme_rdma_device *dev = queue->device; struct ib_device *ibdev = dev->dev; + clear_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags); rdma_destroy_qp(queue->cm_id); ib_free_cq(queue->ib_cq); @@ -538,6 +540,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue, ret = -ENOMEM; goto out_destroy_qp; } + set_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags); return 0; @@ -595,6 +598,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl, return 0; out_destroy_cm_id: + if (test_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &ctrl->queues[0].flags)) + nvme_rdma_destroy_queue_ib(queue); rdma_destroy_id(queue->cm_id); return ret; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* nvme_rdma - leaves provider resources allocated 2016-08-25 22:03 ` Steve Wise @ 2016-08-25 22:06 ` Sagi Grimberg 0 siblings, 0 replies; 6+ messages in thread From: Sagi Grimberg @ 2016-08-25 22:06 UTC (permalink / raw) > This won't quite do it. The reconnect thread can timeout and give up before any > event arrives, and that path also needs to destroy the queue resources if it was > allocated. This is my current patch. I appreciate any comments. (but I'll > resend this formally tomorrow with other changes)... Makes sense... With 2 patches now adding queue flags I think it's time to ramp a proper state machine? Looks like we starting to need it for all the corner cases... ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-08-25 22:06 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-23 16:58 nvme_rdma - leaves provider resources allocated Steve Wise 2016-08-24 9:31 ` Sagi Grimberg 2016-08-24 14:09 ` Steve Wise 2016-08-25 21:52 ` Sagi Grimberg 2016-08-25 22:03 ` Steve Wise 2016-08-25 22:06 ` Sagi Grimberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.