From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Tue, 1 Nov 2016 11:49:03 -0500 Subject: nvmet_rdma crash - DISCONNECT event with NULL queue In-Reply-To: References: <01b401d23458$af277210$0d765630$@opengridcomputing.com> <6f42d056-284d-00fc-2b98-189f54957980@grimberg.me> <01cc01d2345b$d445acd0$7cd10670$@opengridcomputing.com> <4cc25277-429a-4ab9-470c-b3af1428ce93@grimberg.me> <01d101d2345e$2f054390$8d0fcab0$@opengridcomputing.com> Message-ID: <01d901d2345f$da0d2e00$8e278a00$@opengridcomputing.com> > >> So I think that the patch from Bart a few weeks ago was correct: > >> > > > > Not quite. It just guards against a null queue for TIMEWAIT_EXIT, which is only > > generated by the IB_CM. > > Yes, this is why we need ADDR_CHANGE and DISCONNECTED too > "(and include all the relevant cases around it)" > > The other events we don't get to LIVE state and we don't have > other error flows that will trigger queue teardown sequence. > > -- > nvmet-rdma: Fix possible NULL deref when handling rdma cm > events > > When we initiate queue teardown sequence we call rdma_destroy_qp > which clears cm_id->qp, afterwards we call rdma_destroy_id, but > we might see a rdma_cm event in between with a cleared cm_id->qp > so watch out for that and silently ignore the event because this > means that the queue teardown sequence is in progress. > > Signed-off-by: Bart Van Assche > Signed-off-by: Sagi Grimberg > --- > drivers/nvme/target/rdma.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index b4d648536c3e..240888efd920 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -1351,7 +1351,13 @@ static int nvmet_rdma_cm_handler(struct > rdma_cm_id *cm_id, > case RDMA_CM_EVENT_ADDR_CHANGE: > case RDMA_CM_EVENT_DISCONNECTED: > case RDMA_CM_EVENT_TIMEWAIT_EXIT: > - nvmet_rdma_queue_disconnect(queue); > + /* > + * We might end up here when we already freed the qp > + * which means queue release sequence is in progress, > + * so don't get in the way... > + */ > + if (!queue) > + nvmet_rdma_queue_disconnect(queue); > break; > case RDMA_CM_EVENT_DEVICE_REMOVAL: > ret = nvmet_rdma_device_removal(cm_id, queue); > -- This looks good. But you mentioned the 2 rapid-fire keep alive timeout logs for the same controller as being seriously broken. Perhaps that is another problem? Maybe keep alives aren't getting stopped correctly or something... But: I'll try this patch and run for a few hours and see what happens. I believe regardless of a keep alive issue, the above patch is still needed. Steve.