* CQ Doorbells can be touched after queue deleted @ 2014-09-30 18:32 Paul Grabinar 2014-09-30 19:41 ` Keith Busch 0 siblings, 1 reply; 3+ messages in thread From: Paul Grabinar @ 2014-09-30 18:32 UTC (permalink / raw) Hi All, I've encountered an interesting issue with the driver as in v3.17-rc7. The NVMe specification defines writing to CQ doorbells for non-existent queues as "undefined", so it is probably not a good idea to do this. I'm aware of at least one drive that gets very upset if you try. The case I hit was where there is I/O running to the drive, but the drive is being reset in the kthread due to not responding to abort requests. When an I/O request came in, nvme_process_cq was called from nvme_make_request, but the queue no longer exists as it has been torn down by the reset. During nvme_process_cq, the doorbell is updated, which upsets the drive. This is a bit of a corner case, but it has happened. We probably need to skip the doorbell update if the queue has been deleted. ^ permalink raw reply [flat|nested] 3+ messages in thread
* CQ Doorbells can be touched after queue deleted 2014-09-30 18:32 CQ Doorbells can be touched after queue deleted Paul Grabinar @ 2014-09-30 19:41 ` Keith Busch 2014-09-30 19:55 ` Paul Grabinar 0 siblings, 1 reply; 3+ messages in thread From: Keith Busch @ 2014-09-30 19:41 UTC (permalink / raw) On Tue, 30 Sep 2014, Paul Grabinar wrote: > I've encountered an interesting issue with the driver as in v3.17-rc7. > The NVMe specification defines writing to CQ doorbells for non-existent > queues as "undefined", so it is probably not a good idea to do this. > I'm aware of at least one drive that gets very upset if you try. > > The case I hit was where there is I/O running to the drive, but the > drive is being reset in the kthread due to not responding to abort requests. > When an I/O request came in, nvme_process_cq was called from > nvme_make_request, but the queue no longer exists as it has been torn > down by the reset. > During nvme_process_cq, the doorbell is updated, which upsets the drive. > > This is a bit of a corner case, but it has happened. > We probably need to skip the doorbell update if the queue has been deleted. Thanks for the info. I agree the results are undefined in this case. Your situation is interesting, though. The driver won't update the doorbell if no completions were posted, so it sounds like your device is posting completions long after the queues were deleted if the problem occured in the make_request path. Anyway, here's a patch. >From a141c8116d1d8f86fffb58bb312ec4d75cdc82ef Mon Sep 17 00:00:00 2001 From: Keith Busch <keith.busch@intel.com> Date: Tue, 30 Sep 2014 12:51:06 -0600 Subject: [PATCH] NVMe: Skip updating cq doorbell on inactive queues The driver processes the completion queue after taking queues offline to check for new completions that may have occured when the device deleted the queue. We don't want to write the completion queue doorbell in this case as it has undefined results. Signed-off-by: Keith Busch <keith.busch at intel.com> Reported-by: Paul Grabinar <paul.grabinar at ranbarg.com> --- drivers/block/nvme-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index 28aec2d..6c077cd 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -828,7 +828,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) if (head == nvmeq->cq_head && phase == nvmeq->cq_phase) return 0; - writel(head, nvmeq->q_db + nvmeq->dev->db_stride); + if (!nvmeq->q_suspended) + writel(head, nvmeq->q_db + nvmeq->dev->db_stride); nvmeq->cq_head = head; nvmeq->cq_phase = phase; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* CQ Doorbells can be touched after queue deleted 2014-09-30 19:41 ` Keith Busch @ 2014-09-30 19:55 ` Paul Grabinar 0 siblings, 0 replies; 3+ messages in thread From: Paul Grabinar @ 2014-09-30 19:55 UTC (permalink / raw) On 30/09/14 20:41, Keith Busch wrote: > On Tue, 30 Sep 2014, Paul Grabinar wrote: >> I've encountered an interesting issue with the driver as in v3.17-rc7. >> The NVMe specification defines writing to CQ doorbells for non-existent >> queues as "undefined", so it is probably not a good idea to do this. >> I'm aware of at least one drive that gets very upset if you try. >> >> The case I hit was where there is I/O running to the drive, but the >> drive is being reset in the kthread due to not responding to abort >> requests. >> When an I/O request came in, nvme_process_cq was called from >> nvme_make_request, but the queue no longer exists as it has been torn >> down by the reset. >> During nvme_process_cq, the doorbell is updated, which upsets the drive. >> >> This is a bit of a corner case, but it has happened. >> We probably need to skip the doorbell update if the queue has been >> deleted. > > Thanks for the info. I agree the results are undefined in this case. Your > situation is interesting, though. The driver won't update the doorbell > if no completions were posted, so it sounds like your device is posting > completions long after the queues were deleted if the problem occured > in the make_request path. > > Anyway, here's a patch. > >> From a141c8116d1d8f86fffb58bb312ec4d75cdc82ef Mon Sep 17 00:00:00 2001 > From: Keith Busch <keith.busch at intel.com> > Date: Tue, 30 Sep 2014 12:51:06 -0600 > Subject: [PATCH] NVMe: Skip updating cq doorbell on inactive queues > > The driver processes the completion queue after taking queues offline to > check for new completions that may have occured when the device deleted > the queue. We don't want to write the completion queue doorbell in this > case as it has undefined results. > > Signed-off-by: Keith Busch <keith.busch at intel.com> > Reported-by: Paul Grabinar <paul.grabinar at ranbarg.com> > --- > drivers/block/nvme-core.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c > index 28aec2d..6c077cd 100644 > --- a/drivers/block/nvme-core.c > +++ b/drivers/block/nvme-core.c > @@ -828,7 +828,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) > if (head == nvmeq->cq_head && phase == nvmeq->cq_phase) > return 0; > > - writel(head, nvmeq->q_db + nvmeq->dev->db_stride); > + if (!nvmeq->q_suspended) > + writel(head, nvmeq->q_db + nvmeq->dev->db_stride); > nvmeq->cq_head = head; > nvmeq->cq_phase = phase; > Hi, Yes, good point. I guess there are still commands that are being processed and we are getting late completions. Thanks for the patch. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-09-30 19:55 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-30 18:32 CQ Doorbells can be touched after queue deleted Paul Grabinar 2014-09-30 19:41 ` Keith Busch 2014-09-30 19:55 ` Paul Grabinar
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.