From mboxrd@z Thu Jan  1 00:00:00 1970
From: swise@opengridcomputing.com (Steve Wise)
Date: Tue, 1 Nov 2016 11:49:03 -0500
Subject: nvmet_rdma crash - DISCONNECT event with NULL queue
In-Reply-To: <dbe5f18d-7928-f065-920f-753b30fb99a2@grimberg.me>
References: <01b401d23458$af277210$0d765630$@opengridcomputing.com>
 <6f42d056-284d-00fc-2b98-189f54957980@grimberg.me>
 <01cc01d2345b$d445acd0$7cd10670$@opengridcomputing.com>
 <4cc25277-429a-4ab9-470c-b3af1428ce93@grimberg.me>
 <01d101d2345e$2f054390$8d0fcab0$@opengridcomputing.com>
 <dbe5f18d-7928-f065-920f-753b30fb99a2@grimberg.me>
Message-ID: <01d901d2345f$da0d2e00$8e278a00$@opengridcomputing.com>

> >> So I think that the patch from Bart a few weeks ago was correct:
> >>
> >
> > Not quite.  It just guards against a null queue for TIMEWAIT_EXIT, which is
only
> > generated by the IB_CM.
> 
> Yes, this is why we need ADDR_CHANGE and DISCONNECTED too
> "(and include all the relevant cases around it)"
> 
> The other events we don't get to LIVE state and we don't have
> other error flows that will trigger queue teardown sequence.
> 
> --
> nvmet-rdma: Fix possible NULL deref when handling rdma cm
>   events
> 
> When we initiate queue teardown sequence we call rdma_destroy_qp
> which clears cm_id->qp, afterwards we call rdma_destroy_id, but
> we might see a rdma_cm event in between with a cleared cm_id->qp
> so watch out for that and silently ignore the event because this
> means that the queue teardown sequence is in progress.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche at sandisk.com>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/target/rdma.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b4d648536c3e..240888efd920 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1351,7 +1351,13 @@ static int nvmet_rdma_cm_handler(struct
> rdma_cm_id *cm_id,
>          case RDMA_CM_EVENT_ADDR_CHANGE:
>          case RDMA_CM_EVENT_DISCONNECTED:
>          case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> -               nvmet_rdma_queue_disconnect(queue);
> +               /*
> +                * We might end up here when we already freed the qp
> +                * which means queue release sequence is in progress,
> +                * so don't get in the way...
> +                */
> +               if (!queue)
> +                       nvmet_rdma_queue_disconnect(queue);
>                  break;
>          case RDMA_CM_EVENT_DEVICE_REMOVAL:
>                  ret = nvmet_rdma_device_removal(cm_id, queue);
> --

This looks good.  But you mentioned the 2 rapid-fire keep alive timeout logs for
the same controller as being seriously broken. Perhaps that is another problem?
Maybe keep alives aren't getting stopped correctly or something... 

But:  I'll try this patch and run for a few hours and see what happens.  I
believe regardless of a keep alive issue, the above patch is still needed.

Steve.