* [PATCH] nvme: allow queues the chance to quiesce after freezing them
@ 2015-11-19 19:11 Jon Derrick
2015-11-19 20:53 ` Jon Derrick
2015-11-19 21:41 ` Keith Busch
0 siblings, 2 replies; 7+ messages in thread
From: Jon Derrick @ 2015-11-19 19:11 UTC (permalink / raw)
A panic was discovered while doing io and hitting the sysfs reset.
Because io was completing successfully, the nvme_dev_shutdown code
detected this non-idle state as a stuck state and started to tear down
the queues. This resulted in a paging error when nvme_process_cq wrote
the doorbell of a deleted queue.
This patch allows some time after starting the queue freeze for queues
to quiesce on their own. It also sets a new nvme_queue member, frozen,
to prevent writing of the cq doorbell. If the queues successfully
quiesce, nvme_process_cq will run upon resuming. If the queues don't
quiesce, existing code considers it a dead controller and is torn down.
Signed-off-by: Jon Derrick <jonathan.derrick at intel.com>
---
block/blk-mq.c | 8 ++++++++
drivers/block/nvme-core.c | 33 ++++++++++++++++++++++++++++++++-
include/linux/blk-mq.h | 1 +
3 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85f0143..b7fa323 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -128,6 +128,14 @@ static void blk_mq_freeze_queue_wait(struct request_queue *q)
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
}
+long blk_mq_freeze_queue_wait_timeout(struct request_queue *q, long timeout)
+{
+ return wait_event_timeout(q->mq_freeze_wq,
+ percpu_ref_is_zero(&q->mq_usage_counter),
+ timeout);
+}
+EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
+
/*
* Guarantee no request is in use, so we can change any data structure of
* the queue afterward.
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ccc0c1f..183a868 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -122,6 +122,7 @@ struct nvme_queue {
u8 cq_phase;
u8 cqe_seen;
struct async_cmd_info cmdinfo;
+ bool frozen;
};
/*
@@ -977,7 +978,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
return 0;
- writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+ if (unlikely(!nvmeq->frozen))
+ writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
nvmeq->cq_head = head;
nvmeq->cq_phase = phase;
@@ -2774,6 +2776,8 @@ static void nvme_dev_list_remove(struct nvme_dev *dev)
static void nvme_freeze_queues(struct nvme_dev *dev)
{
struct nvme_ns *ns;
+ unsigned long start, timeout;
+ int i;
list_for_each_entry(ns, &dev->namespaces, list) {
blk_mq_freeze_queue_start(ns->queue);
@@ -2785,11 +2789,28 @@ static void nvme_freeze_queues(struct nvme_dev *dev)
blk_mq_cancel_requeue_work(ns->queue);
blk_mq_stop_hw_queues(ns->queue);
}
+
+ for (i = 1; i < dev->queue_count; i++) {
+ struct nvme_queue *nvmeq = dev->queues[i];
+ if (!nvmeq)
+ continue;
+ nvmeq->frozen = true;
+ }
+
+ start = jiffies;
+ list_for_each_entry(ns, &dev->namespaces, list) {
+ timeout = ns->queue->tag_set->timeout;
+ if (time_after_eq(jiffies, start + timeout))
+ timeout = 0;
+ blk_mq_freeze_queue_wait_timeout(ns->queue, timeout);
+ }
}
static void nvme_unfreeze_queues(struct nvme_dev *dev)
{
struct nvme_ns *ns;
+ unsigned long flags;
+ int i;
list_for_each_entry(ns, &dev->namespaces, list) {
queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, ns->queue);
@@ -2797,6 +2818,16 @@ static void nvme_unfreeze_queues(struct nvme_dev *dev)
blk_mq_start_stopped_hw_queues(ns->queue, true);
blk_mq_kick_requeue_list(ns->queue);
}
+
+ for (i = 1; i < dev->queue_count; i++) {
+ struct nvme_queue *nvmeq = dev->queues[i];
+ if (!nvmeq)
+ continue;
+ nvmeq->frozen = false;
+ spin_lock_irqsave(&nvmeq->q_lock, flags);
+ nvme_process_cq(nvmeq);
+ spin_unlock_irqrestore(&nvmeq->q_lock, flags);
+ }
}
static void nvme_dev_shutdown(struct nvme_dev *dev)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5e7d43a..3741e59 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -228,6 +228,7 @@ void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_unfreeze_queue(struct request_queue *q);
void blk_mq_freeze_queue_start(struct request_queue *q);
+long blk_mq_freeze_queue_wait_timeout(struct request_queue *q, long timeout);
/*
* Driver command data is immediately after the request. So subtract request
--
2.5.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH] nvme: allow queues the chance to quiesce after freezing them
2015-11-19 19:11 [PATCH] nvme: allow queues the chance to quiesce after freezing them Jon Derrick
@ 2015-11-19 20:53 ` Jon Derrick
2015-11-19 21:41 ` Keith Busch
1 sibling, 0 replies; 7+ messages in thread
From: Jon Derrick @ 2015-11-19 20:53 UTC (permalink / raw)
> - writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> + if (unlikely(!nvmeq->frozen))
> + writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
Oops... This last minute addition should have been a likely().
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] nvme: allow queues the chance to quiesce after freezing them
2015-11-19 19:11 [PATCH] nvme: allow queues the chance to quiesce after freezing them Jon Derrick
2015-11-19 20:53 ` Jon Derrick
@ 2015-11-19 21:41 ` Keith Busch
2015-11-19 21:53 ` Christoph Hellwig
2015-11-19 22:12 ` Jon Derrick
1 sibling, 2 replies; 7+ messages in thread
From: Keith Busch @ 2015-11-19 21:41 UTC (permalink / raw)
On Thu, Nov 19, 2015@12:11:52PM -0700, Jon Derrick wrote:
> A panic was discovered while doing io and hitting the sysfs reset.
> Because io was completing successfully, the nvme_dev_shutdown code
> detected this non-idle state as a stuck state and started to tear down
> the queues. This resulted in a paging error when nvme_process_cq wrote
> the doorbell of a deleted queue.
>
> This patch allows some time after starting the queue freeze for queues
> to quiesce on their own. It also sets a new nvme_queue member, frozen,
> to prevent writing of the cq doorbell. If the queues successfully
> quiesce, nvme_process_cq will run upon resuming. If the queues don't
> quiesce, existing code considers it a dead controller and is torn down.
I think all we really want is skip notifying completions on a
"suspended" queue. We can tell by the value of the cq-vector,
and it's already lock protected.
It also sounds like we need to poll the cq after the delete completes
to catch successful completions before we force cancel the rest.
This appears to work for me. Does it pass your test?
---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 394fd16..930042fa 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -968,7 +968,8 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
return;
- writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+ if (likely(nvmeq->cq_vector >= 0))
+ writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
nvmeq->cq_head = head;
nvmeq->cq_phase = phase;
@@ -2787,6 +2788,10 @@ static void nvme_del_queue_end(struct nvme_queue *nvmeq)
{
struct nvme_delq_ctx *dq = nvmeq->cmdinfo.ctx;
nvme_put_dq(dq);
+
+ spin_lock_irq(&nvmeq->q_lock);
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
}
static int adapter_async_del_queue(struct nvme_queue *nvmeq, u8 opcode,
--
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH] nvme: allow queues the chance to quiesce after freezing them
2015-11-19 21:41 ` Keith Busch
@ 2015-11-19 21:53 ` Christoph Hellwig
2015-11-19 22:20 ` Keith Busch
2015-11-19 22:12 ` Jon Derrick
1 sibling, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-11-19 21:53 UTC (permalink / raw)
On Thu, Nov 19, 2015@09:41:57PM +0000, Keith Busch wrote:
> On Thu, Nov 19, 2015@12:11:52PM -0700, Jon Derrick wrote:
> > A panic was discovered while doing io and hitting the sysfs reset.
> > Because io was completing successfully, the nvme_dev_shutdown code
> > detected this non-idle state as a stuck state and started to tear down
> > the queues. This resulted in a paging error when nvme_process_cq wrote
> > the doorbell of a deleted queue.
> >
> > This patch allows some time after starting the queue freeze for queues
> > to quiesce on their own. It also sets a new nvme_queue member, frozen,
> > to prevent writing of the cq doorbell. If the queues successfully
> > quiesce, nvme_process_cq will run upon resuming. If the queues don't
> > quiesce, existing code considers it a dead controller and is torn down.
>
> I think all we really want is skip notifying completions on a
> "suspended" queue. We can tell by the value of the cq-vector,
> and it's already lock protected.
>
> It also sounds like we need to poll the cq after the delete completes
> to catch successful completions before we force cancel the rest.
>
> This appears to work for me. Does it pass your test?
This looks reasonable. I ran into stray ->q_db derference a lot during
reset testing, but after my abort and reset rewrites ([1] for th latest
version) I couldn't reproduce it any more.
[1]
http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-req.8
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] nvme: allow queues the chance to quiesce after freezing them
2015-11-19 21:41 ` Keith Busch
2015-11-19 21:53 ` Christoph Hellwig
@ 2015-11-19 22:12 ` Jon Derrick
1 sibling, 0 replies; 7+ messages in thread
From: Jon Derrick @ 2015-11-19 22:12 UTC (permalink / raw)
> I think all we really want is skip notifying completions on a
> "suspended" queue. We can tell by the value of the cq-vector,
> and it's already lock protected.
>
Good catch. I wish I had made that relationship :)
(Though I was hoping somebody would figure out a less hacky way of doing it than I did)
> It also sounds like we need to poll the cq after the delete completes
> to catch successful completions before we force cancel the rest.
>
> This appears to work for me. Does it pass your test?
Passes my tests as well so you can add my tb:
Tested-by: Jon Derrick <jonathan.derrick at intel.com>
Re: Christoph
>This looks reasonable. I ran into stray ->q_db derference a lot during reset testing, but after my abort and reset rewrites ([1] for th latest
version) I couldn't reproduce it any more.
Sorry this was against 4.3 (and was reproducible on 4.4-rc1). I will give your repo a test shortly
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] nvme: allow queues the chance to quiesce after freezing them
2015-11-19 21:53 ` Christoph Hellwig
@ 2015-11-19 22:20 ` Keith Busch
2015-11-19 22:24 ` Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2015-11-19 22:20 UTC (permalink / raw)
On Thu, Nov 19, 2015@01:53:11PM -0800, Christoph Hellwig wrote:
> This looks reasonable. I ran into stray ->q_db derference a lot during
> reset testing, but after my abort and reset rewrites ([1] for th latest
> version) I couldn't reproduce it any more.
>
> [1]
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-req.8
There's good stuff in this tree. I've been using it most of this week
to study the reset handling and still have a concern here.
I mentioned ending IO in the reset handler in a previous review, and I can
create memory and data corruption with well timed delays (or badly timed,
depending on your perspective). We can't end IO while the controller
owns it. Even though the timeout handler often means the controller is
not going to return the command anyway, ending it releases memory the
controller may still access.
I don't see how we can disable the controller to forcefully end commands
with merged probe and reset work, and thinking these need to be kept
separate. Timeouts are done in a workqueue now, so can we do controller
shutdown inline with the timeout handler?
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] nvme: allow queues the chance to quiesce after freezing them
2015-11-19 22:20 ` Keith Busch
@ 2015-11-19 22:24 ` Christoph Hellwig
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2015-11-19 22:24 UTC (permalink / raw)
On Thu, Nov 19, 2015@10:20:12PM +0000, Keith Busch wrote:
> I don't see how we can disable the controller to forcefully end commands
> with merged probe and reset work, and thinking these need to be kept
> separate. Timeouts are done in a workqueue now, so can we do controller
> shutdown inline with the timeout handler?
That's one of the two options I'm exploring. The other is a new request
flag that tells the block later not to reuse the tag after a return from
the timeout handler.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-11-19 22:24 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-19 19:11 [PATCH] nvme: allow queues the chance to quiesce after freezing them Jon Derrick
2015-11-19 20:53 ` Jon Derrick
2015-11-19 21:41 ` Keith Busch
2015-11-19 21:53 ` Christoph Hellwig
2015-11-19 22:20 ` Keith Busch
2015-11-19 22:24 ` Christoph Hellwig
2015-11-19 22:12 ` Jon Derrick
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).