From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Wed, 22 Aug 2018 10:08:25 -0500 Subject: [resend] Crash in nvmet_req_init() - null req->rsp pointer In-Reply-To: <31a08853-e2a7-709a-9251-e6c64fda22dd@grimberg.me> References: <1eb5a94a-07ed-1c2a-47fe-cc9a4bf32466@opengridcomputing.com> <31a08853-e2a7-709a-9251-e6c64fda22dd@grimberg.me> Message-ID: <4fb0053a-1493-369e-cac5-a91133ad9499@opengridcomputing.com> On 8/20/2018 3:47 PM, Sagi Grimberg wrote: > >> Resending in plain text... >> >> ---- >> >> Hey guys, >> >> I'm debugging a nvmet_rdma crash on the linux-4.14.52 stable kernel >> code.? Under heavy load, including 80 nvmf devices, after 13 hours of >> running, I see an Oops [1] when the target is processing a new ingress >> nvme command.? It crashes in nvmet_req_init() because req->rsp is NULL: >> >> ?? 493?? bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, >> ?? 494?????????????????? struct nvmet_sq *sq, struct nvmet_fabrics_ops >> *ops) >> ?? 495?? { >> ?? 496?????????? u8 flags = req->cmd->common.flags; >> ?? 497?????????? u16 status; >> ?? 498 >> ?? 499?????????? req->cq = cq; >> ?? 500?????????? req->sq = sq; >> ?? 501?????????? req->ops = ops; >> ?? 502?????????? req->sg = NULL; >> ?? 503?????????? req->sg_cnt = 0; >> ?? 504?????????? req->rsp->status = 0; <-- HERE >> >> The? nvme command opcode is nvme_cmd_write.? The nvmet_rdma_queue state >> is NVMET_RDMA_Q_LIVE.? The nvmet_req looks valid [2].? IE not garbage. >> But it seems very bad that req->rsp is NULL! :) >> >> Any thoughts?? I didn't see anything like this in recent nvmf fixes... > > Is it possible that you ran out of rsps and got a corrupted rsp? That is what it looks like. Because the nvmet_rdma_rsp pointer returned from nvmet_rdma_get_rsp() was not within the bounds of the response struct array allocated for that queue. I wasn't sure how that could happen? I would think first_entry() would return NULL if the list was empty, but I guess not. > > How about trying out this patch to add more information: I'll try this out. Thanks Sagi! > -- > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index e7f43d1e1779..890d9c45ca33 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -66,6 +66,7 @@ struct nvmet_rdma_rsp { > > ??????? struct nvmet_req??????? req; > > +?????? bool??????????????????? allocated; > ??????? u8????????????????????? n_rdma; > ??????? u32???????????????????? flags; > ??????? u32???????????????????? invalidate_rkey; > @@ -174,11 +175,20 @@ nvmet_rdma_get_rsp(struct nvmet_rdma_queue *queue) > ??????? unsigned long flags; > > ??????? spin_lock_irqsave(&queue->rsps_lock, flags); > -?????? rsp = list_first_entry(&queue->free_rsps, > +?????? rsp = list_first_entry_or_null(&queue->free_rsps, > ??????????????????????????????? struct nvmet_rdma_rsp, free_list); > -?????? list_del(&rsp->free_list); > +?????? if (rsp) { > +?????????????? list_del(&rsp->free_list); > +?????????????? rsp->allocated = false; > +?????? } > ??????? spin_unlock_irqrestore(&queue->rsps_lock, flags); > > +?????? if (!rsp) { > +?????????????? pr_debug("dynamically allocated rsp\n"); > +?????????????? rsp = kmalloc(sizeof(*rsp), GFP_KERNEL); > +?????????????? rsp->allocated = true; > +?????? } > + > ??????? return rsp; > ?} > > @@ -187,6 +197,11 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp) > ?{ > ??????? unsigned long flags; > > +?????? if (rsp->allocated) { > +?????????????? kfree(rsp); > +?????????????? return; > +?????? } > + > ??????? spin_lock_irqsave(&rsp->queue->rsps_lock, flags); > ??????? list_add_tail(&rsp->free_list, &rsp->queue->free_rsps); > ??????? spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags); > -- >