From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Tue, 1 Nov 2016 11:37:06 -0500 Subject: nvmet_rdma crash - DISCONNECT event with NULL queue In-Reply-To: <4cc25277-429a-4ab9-470c-b3af1428ce93@grimberg.me> References: <01b401d23458$af277210$0d765630$@opengridcomputing.com> <6f42d056-284d-00fc-2b98-189f54957980@grimberg.me> <01cc01d2345b$d445acd0$7cd10670$@opengridcomputing.com> <4cc25277-429a-4ab9-470c-b3af1428ce93@grimberg.me> Message-ID: <01d101d2345e$2f054390$8d0fcab0$@opengridcomputing.com> > > >>> I just hit an nvmf target NULL pointer deref BUG after a few hours of > > keep-alive > >>> timeout testing. It appears that nvmet_rdma_cm_handler() was called with > >>> cm_id->qp == NULL, so the local nvmet_rdma_queue * variable queue is left as > >>> NULL. But then nvmet_rdma_queue_disconnect() is called with queue == NULL > >> which > >>> causes the crash. > >> > >> AFAICT, the only way cm_id->qp is NULL is for a scenario we didn't even > >> get to allocate a queue-pair (e.g. calling rdma_create_qp). The teardown > >> paths does not nullify cm_id->qp... > > > > rdma_destroy_qp() nulls out cm_id->qp. > > pphh, somehow managed to miss it... > > So we have a case where we can call rdma_destroy_qp and > then rdma_destroy_id but still get events on the cm_id... > Not very nice... > > So I think that the patch from Bart a few weeks ago was correct: > Not quite. It just guards against a null queue for TIMEWAIT_EXIT, which is only generated by the IB_CM. > --- > drivers/nvme/target/rdma.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index d1aea17..a61e47f 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -1354,9 +1354,12 @@ static int nvmet_rdma_cm_handler(struct > rdma_cm_id *cm_id, > break; > case RDMA_CM_EVENT_ADDR_CHANGE: > case RDMA_CM_EVENT_DISCONNECTED: > - case RDMA_CM_EVENT_TIMEWAIT_EXIT: > nvmet_rdma_queue_disconnect(queue); > break; > + case RDMA_CM_EVENT_TIMEWAIT_EXIT: > + if (queue) > + nvmet_rdma_queue_disconnect(queue); > + break; > case RDMA_CM_EVENT_DEVICE_REMOVAL: > ret = nvmet_rdma_device_removal(cm_id, queue); > break; > --- > > In case this fixes the issue (as expected) I'll queue it up > with a change log and a code comment on why we need to do > this (and include all the relevant cases around it)...