nvme_rdma - leaves provider resources allocated

All of lore.kernel.org
 help / color / mirror / Atom feed

* nvme_rdma - leaves provider resources allocated
@ 2016-08-23 16:58 Steve Wise
  2016-08-24  9:31 ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Wise @ 2016-08-23 16:58 UTC (permalink / raw)


Assume an nvme_rdma host has one attached controller in RECONNECTING state, and
that controller has failed to reconnect at least once and thus is in the
delay_schedule time before retrying the connection.  At that moment, there are
no cm_ids allocated for that controller because the admin queue and the io
queues have been freed.  So nvme_rdma cannot get a DEVICE_REMOVAL from the
rdma_cm.  This means if the underlying provider module is removed, it will be
removed with resources still allocated by nvme_rdma.  For iw_cxgb4, this causes
a BUG_ON() in gen_pool_destroy() because MRs are still allocated for the
controller.

Thoughts on how to fix this?

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* nvme_rdma - leaves provider resources allocated
  2016-08-23 16:58 nvme_rdma - leaves provider resources allocated Steve Wise
@ 2016-08-24  9:31 ` Sagi Grimberg
  2016-08-24 14:09   ` Steve Wise
  0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2016-08-24  9:31 UTC (permalink / raw)



> Assume an nvme_rdma host has one attached controller in RECONNECTING state, and
> that controller has failed to reconnect at least once and thus is in the
> delay_schedule time before retrying the connection.  At that moment, there are
> no cm_ids allocated for that controller because the admin queue and the io
> queues have been freed.  So nvme_rdma cannot get a DEVICE_REMOVAL from the
> rdma_cm.  This means if the underlying provider module is removed, it will be
> removed with resources still allocated by nvme_rdma.  For iw_cxgb4, this causes
> a BUG_ON() in gen_pool_destroy() because MRs are still allocated for the
> controller.
>
> Thoughts on how to fix this?

Hey Steve,

I think it's time to go back to your client register proposal.

I can't think of any way to get it right at the moment...

Maybe if we can make it only do something meaningful in remove_one()
to handle device removal we can get away with it...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* nvme_rdma - leaves provider resources allocated
  2016-08-24  9:31 ` Sagi Grimberg
@ 2016-08-24 14:09   ` Steve Wise
  2016-08-25 21:52     ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Wise @ 2016-08-24 14:09 UTC (permalink / raw)


> 
> > Assume an nvme_rdma host has one attached controller in RECONNECTING state,
> and
> > that controller has failed to reconnect at least once and thus is in the
> > delay_schedule time before retrying the connection.  At that moment, there
are
> > no cm_ids allocated for that controller because the admin queue and the io
> > queues have been freed.  So nvme_rdma cannot get a DEVICE_REMOVAL from
> the
> > rdma_cm.  This means if the underlying provider module is removed, it will
be
> > removed with resources still allocated by nvme_rdma.  For iw_cxgb4, this
causes
> > a BUG_ON() in gen_pool_destroy() because MRs are still allocated for the
> > controller.
> >
> > Thoughts on how to fix this?
> 
> Hey Steve,
> 
> I think it's time to go back to your client register proposal.
> 
> I can't think of any way to get it right at the moment...
> 
> Maybe if we can make it only do something meaningful in remove_one()
> to handle device removal we can get away with it...

Hey Sagi, 

I'm finalizing a WIP series that provides a different approach.  (we can
certainly reconsider my ib_client patch too).  But my WIP adds the concept of an
"unplug" cm_id for each nvme_rdma_ctrl controller.  When the controller is first
created and the admin qp is connected to the target, the unplug_cm_id is created
and address resolution is done on it to bind it to the same device that the
admin QP is bound to.   This unplug_cm_id remains across any/all kato recovery
and thus will always be available for DEVICE_REMOVAL events.  This simplifies
the unplug handler because the cm_id isn't associated with any of the IO queues
nor the admin queue.  

I also found another bug:  if the reconnect worker times out waiting for rdma
connection setup on an IO or admin QP, a QP is leaked.   I'm looking into this
as well.

Do you have any thoughts on the controller reference around deletion issue I
posted?  

http://lists.infradead.org/pipermail/linux-nvme/2016-August/005919.html

Thanks!

Steve.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* nvme_rdma - leaves provider resources allocated
  2016-08-24 14:09   ` Steve Wise
@ 2016-08-25 21:52     ` Sagi Grimberg
  2016-08-25 22:03       ` Steve Wise
  0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2016-08-25 21:52 UTC (permalink / raw)



> Hey Sagi,
>
> I'm finalizing a WIP series that provides a different approach.  (we can
> certainly reconsider my ib_client patch too).  But my WIP adds the concept of an
> "unplug" cm_id for each nvme_rdma_ctrl controller.  When the controller is first
> created and the admin qp is connected to the target, the unplug_cm_id is created
> and address resolution is done on it to bind it to the same device that the
> admin QP is bound to.   This unplug_cm_id remains across any/all kato recovery
> and thus will always be available for DEVICE_REMOVAL events.  This simplifies
> the unplug handler because the cm_id isn't associated with any of the IO queues
> nor the admin queue.

OK, let's wait for the patches...

> I also found another bug:  if the reconnect worker times out waiting for rdma
> connection setup on an IO or admin QP, a QP is leaked.   I'm looking into this
> as well.

Hmm, I think you're right. If we passed address resolution but failed
route (or got a general CONNECT/UNREACHABLE errors) we won't free the
queue...

I think this should fix this:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ab545fb347a0..452727c5ea13 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1382,10 +1382,12 @@ static int nvme_rdma_cm_handler(struct 
rdma_cm_id *cm_id,
         case RDMA_CM_EVENT_REJECTED:
                 cm_error = nvme_rdma_conn_rejected(queue, ev);
                 break;
-       case RDMA_CM_EVENT_ADDR_ERROR:
         case RDMA_CM_EVENT_ROUTE_ERROR:
         case RDMA_CM_EVENT_CONNECT_ERROR:
         case RDMA_CM_EVENT_UNREACHABLE:
+               nvme_rdma_destroy_queue_ib(queue);
+       case RDMA_CM_EVENT_ADDR_ERROR:
+               /* FALLTHRU */
                 dev_dbg(queue->ctrl->ctrl.device,
                         "CM error event %d\n", ev->event);
                 cm_error = -ECONNRESET;
--

>
> Do you have any thoughts on the controller reference around deletion issue I
> posted?
>
> http://lists.infradead.org/pipermail/linux-nvme/2016-August/005919.html

Looks fine, but you need to use kref_get_unless_zero.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* nvme_rdma - leaves provider resources allocated
  2016-08-25 21:52     ` Sagi Grimberg
@ 2016-08-25 22:03       ` Steve Wise
  2016-08-25 22:06         ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Wise @ 2016-08-25 22:03 UTC (permalink / raw)



> > Hey Sagi,
> >
> > I'm finalizing a WIP series that provides a different approach.  (we can
> > certainly reconsider my ib_client patch too).  But my WIP adds the concept
of an
> > "unplug" cm_id for each nvme_rdma_ctrl controller.  When the controller is
first
> > created and the admin qp is connected to the target, the unplug_cm_id is
created
> > and address resolution is done on it to bind it to the same device that the
> > admin QP is bound to.   This unplug_cm_id remains across any/all kato
recovery
> > and thus will always be available for DEVICE_REMOVAL events.  This
simplifies
> > the unplug handler because the cm_id isn't associated with any of the IO
queues
> > nor the admin queue.
> 
> OK, let's wait for the patches...

I'll send them tomorrow.  They're still Work In Progress (WIP) though.  There is
at least one other bug I'm chasing.

> 
> > I also found another bug:  if the reconnect worker times out waiting for
rdma
> > connection setup on an IO or admin QP, a QP is leaked.   I'm looking into
this
> > as well.
> 
> Hmm, I think you're right. If we passed address resolution but failed
> route (or got a general CONNECT/UNREACHABLE errors) we won't free the
> queue...
> 
> I think this should fix this:
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index ab545fb347a0..452727c5ea13 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1382,10 +1382,12 @@ static int nvme_rdma_cm_handler(struct
> rdma_cm_id *cm_id,
>          case RDMA_CM_EVENT_REJECTED:
>                  cm_error = nvme_rdma_conn_rejected(queue, ev);
>                  break;
> -       case RDMA_CM_EVENT_ADDR_ERROR:
>          case RDMA_CM_EVENT_ROUTE_ERROR:
>          case RDMA_CM_EVENT_CONNECT_ERROR:
>          case RDMA_CM_EVENT_UNREACHABLE:
> +               nvme_rdma_destroy_queue_ib(queue);
> +       case RDMA_CM_EVENT_ADDR_ERROR:
> +               /* FALLTHRU */
>                  dev_dbg(queue->ctrl->ctrl.device,
>                          "CM error event %d\n", ev->event);
>                  cm_error = -ECONNRESET;
> --

This won't quite do it.  The reconnect thread can timeout and give up before any
event arrives, and that path also needs to destroy the queue resources if it was
allocated.  This is my current patch.  I appreciate any comments.  (but I'll
resend this formally tomorrow with other changes)...

---

nvme-rdma: destroy nvme queue rdma resources on connect failure

From: Steve Wise <swise@opengridcomputing.com>

After address resolution, the nvme_rdma_queue rdma resources are
allocated.  If rdma route resolution or the connect fails, or the
controller reconnect times out and gives up, then the rdma resources
need to be freed.  Otherwise, rdma resources are leaked.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/nvme/host/rdma.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 6198eaa..ea271df 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -87,6 +87,7 @@ struct nvme_rdma_request {

 enum nvme_rdma_queue_flags {
        NVME_RDMA_Q_CONNECTED = (1 << 0),
+       NVME_RDMA_IB_QUEUE_ALLOCATED = (1 << 1),
 };

 struct nvme_rdma_queue {
@@ -488,6 +489,7 @@ static void nvme_rdma_destroy_queue_ib(struct
nvme_rdma_queue *queue)
        struct nvme_rdma_device *dev = queue->device;
        struct ib_device *ibdev = dev->dev;

+       clear_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags);
        rdma_destroy_qp(queue->cm_id);
        ib_free_cq(queue->ib_cq);

@@ -538,6 +540,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue
*queue,
                ret = -ENOMEM;
                goto out_destroy_qp;
        }
+       set_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags);

        return 0;

@@ -595,6 +598,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
        return 0;

 out_destroy_cm_id:
+       if (test_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &ctrl->queues[0].flags))
+               nvme_rdma_destroy_queue_ib(queue);
        rdma_destroy_id(queue->cm_id);
        return ret;
 }

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* nvme_rdma - leaves provider resources allocated
  2016-08-25 22:03       ` Steve Wise
@ 2016-08-25 22:06         ` Sagi Grimberg
  0 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2016-08-25 22:06 UTC (permalink / raw)



> This won't quite do it.  The reconnect thread can timeout and give up before any
> event arrives, and that path also needs to destroy the queue resources if it was
> allocated.  This is my current patch.  I appreciate any comments.  (but I'll
> resend this formally tomorrow with other changes)...

Makes sense...

With 2 patches now adding queue flags I think it's time to ramp
a proper state machine? Looks like we starting to need it for
all the corner cases...

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-08-25 22:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-23 16:58 nvme_rdma - leaves provider resources allocated Steve Wise
2016-08-24  9:31 ` Sagi Grimberg
2016-08-24 14:09   ` Steve Wise
2016-08-25 21:52     ` Sagi Grimberg
2016-08-25 22:03       ` Steve Wise
2016-08-25 22:06         ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.