From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Thu, 8 Sep 2016 14:26:00 -0500 Subject: nvmf/rdma host crash during heavy load and keep alive recovery In-Reply-To: <01f201d20a05$6abde5f0$4039b1d0$@opengridcomputing.com> References: <018301d1e9e1$da3b2e40$8eb18ac0$@opengridcomputing.com> <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com> <013701d1f320$57b185d0$07149170$@opengridcomputing.com> <018401d1f32b$792cfdb0$6b86f910$@opengridcomputing.com> <01a301d1f339$55ba8e70$012fab50$@opengridcomputing.com> <2fb1129c-424d-8b2d-7101-b9471e897dc8@grimberg.me> <004701d1f3d8$760660b0$62132210$@opengridcomputing.com> <008101d1f3de$557d2850$007778f0$@opengridcomputing.com> <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com> <01c301d1f702$d28c7270$77a55750$@opengridcomputing.com> <6ef9b0d1-ce84-4598-74db-7adeed313bb6@grimberg.me> <045601d1f803$a9d73a20$fd85ae60$@opengridcomputing.com> <69c0e819-76d9-286b-c4fb-22f087f36ff1@grimberg.me> <08b701d1f8ba$a709ae10$f51d0a30$@opengridcomputing.com> <01c301d20485$0dfcd2c0$29f67840$@opengridcomputing.com> <0c159abb -24ee-21bf-09d2-9fe7d2 69a2eb@grimberg.me> <039601d2094f$80481640$80d842c0$@opengridcomputing.com> <9fd1f090-3b86-b496-d8c0-225ac0815fbe@grimbe rg.me> <01bc01d209f5$1 b7d7510$52785f30$@opengridcomputing.com> <01f201d20a05$6abde5f0$4039b1d0$@opengridcomputing.com> Message-ID: <01f301d20a06$d4c53060$7e4f9120$@opengridcomputing.com> > While working this with debug code to verify that we never create a qp, > cq, or cm_id where one already exists for an nvme_rdma_queue, I discovered > a bug where the Q_DELETING flag is never cleared, and thus a reconnect can > leak qps and cm_ids. The fix, I think, is this: > > @@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl > *ctrl, > int ret; > > queue = &ctrl->queues[idx]; > + queue->flags = 0; > queue->ctrl = ctrl; > init_completion(&queue->cm_done); > > I think maybe the clearing of the Q_DELETING flag was lost when we moved > to using the ib_client for device removal. I'll polish this up and > submit a patch. It should go with the next 4.8-rc push I think. Actually, I think the Q_DELETING flag is no longer needed. Sagi, can you have look at NVME_RDMA_Q_DELETING in the latest code? I think the ib_client patch made the original Q_DELETING patch obsolete. And the original Q_DELETING patch probably needed the above chunk for correctness... Let me know if you want me to submit something for this issue. We could fix the original patches as they are still only in your nvmf-4.8-rc repo... Steve.