[PATCHv2 0/8] nvme timeout fixes v2

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv2 0/8] nvme timeout fixes v2
@ 2018-05-22 22:03 Keith Busch
  2018-05-22 22:03 ` [PATCHv2 1/8] nvme: Sync request queues on reset Keith Busch
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


While some substantial changes to the blk-mq's timeout handling are
under consideration, the following should be compatible with existing
implementation, and current proposals for the future.

v1 -> v2:

  Reverse the sync/disable sequence on nvme reset in case a request was
  prevented from completing in the timeout work, then unconditionally
  disable the device to reclaim any remaining outstanding tags.

  Ensure we're not incorrectly clearing the new queue freeze flag.

  Do not start IO queues within the CONNECTING state. The queues will be
  started after entering the LIVE state.

  Fixed the ratelimit to the intended print.

  A new fix for a very unlikely race where an IO was blocked from
  completing prior to a reset, timing out yet again in CONNECTING state.

  A fix for a failed controller recovery that could leave queues quiesced
  when trying to unbind the driver.

  Whitespace fixes

Keith Busch (8):
  nvme: Sync request queues on reset
  nvme-pci: Fix queue freeze criteria on reset
  nvme: Move all IO out of controller reset
  nvme: Allow reset from CONNECTING state
  nvme-pci: Attempt reset retry for IO failures
  nvme-pci: Rate limit the nvme timeout warnings
  nvme-pci: End IO requests immediately in CONNECTING state
  nvme-pci: Ensure queues are not quiesced on dead controller

 drivers/nvme/host/core.c |  24 +++++++++-
 drivers/nvme/host/nvme.h |   2 +
 drivers/nvme/host/pci.c  | 117 +++++++++++++++++++++++++++++++++--------------
 3 files changed, 106 insertions(+), 37 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 1/8] nvme: Sync request queues on reset
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 2/8] nvme-pci: Fix queue freeze criteria " Keith Busch
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


This patch fixes races that occur with simultaneous controller
resets by synchronizing request queues prior to initializing the
controller. Withouth this, a thread may attempt disabling a controller
at the same time as we're trying to enable it.

After the queues are synced, the driver's reset work will attempt
disabling the device, as this will reclaim any outstanding tags in the
event any of a request was locked out from completing in the previous
reset attempts.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 21 +++++++++++++++++++--
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 21 +++++++++++++--------
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99b857e5a7a9..1de68b56b318 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3471,6 +3471,12 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(nvme_init_ctrl);
 
+static void nvme_start_queue(struct nvme_ns *ns)
+{
+	blk_mq_unquiesce_queue(ns->queue);
+	blk_mq_kick_requeue_list(ns->queue);
+}
+
 /**
  * nvme_kill_queues(): Ends all namespace queues
  * @ctrl: the dead controller that needs to end
@@ -3499,7 +3505,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		blk_set_queue_dying(ns->queue);
 
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		nvme_start_queue(ns);
 	}
 	up_read(&ctrl->namespaces_rwsem);
 }
@@ -3569,11 +3575,22 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unquiesce_queue(ns->queue);
+		nvme_start_queue(ns);
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
 {
 	if (!ctrl->ops->reinit_request)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 17d2f7cf3fed..c15c2ee7f61a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..deb44c27cdaf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2217,16 +2217,18 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
-	if (!dead && dev->ctrl.queue_count > 0) {
+	if (dev->ctrl.queue_count > 0) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
 		 * make sure that it doesn't access the host memoery anymore,
 		 * but I'd rather be safe than sorry..
 		 */
-		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
-		nvme_disable_io_queues(dev);
+		if (!dead) {
+			if (dev->host_mem_descs)
+				nvme_set_host_mem(dev, 0);
+			nvme_disable_io_queues(dev);
+		}
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
@@ -2307,11 +2309,14 @@ static void nvme_reset_work(struct work_struct *work)
 		goto out;
 
 	/*
-	 * If we're called to reset a live controller first shut it down before
-	 * moving on.
+	 * Ensure there are no timeout work in progress prior to forcefully
+	 * disabling the queue. There is no harm in disabling the device even
+	 * when it was already disabled, as this will forcefully reclaim any
+	 * IOs that are stuck due to blk-mq's timeout handling that prevents
+	 * timed out requests from completing.
 	 */
-	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+	nvme_sync_queues(&dev->ctrl);
+	nvme_dev_disable(dev, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 2/8] nvme-pci: Fix queue freeze criteria on reset
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
  2018-05-22 22:03 ` [PATCHv2 1/8] nvme: Sync request queues on reset Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 3/8] nvme: Move all IO out of controller reset Keith Busch
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


The driver had been relying on the pci_dev to maintain the state of
the pci device to know when starting a freeze would be appropriate. The
blktests block/011 however shows us that users may alter the state of
pci_dev out from under drivers and break the criteria we had been using.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index deb44c27cdaf..8949ea3609d8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2196,24 +2196,22 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
 	mutex_lock(&dev->shutdown_lock);
-	if (pci_is_enabled(pdev)) {
+	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE &&
+	    (dev->ctrl.state == NVME_CTRL_LIVE ||
+	     dev->ctrl.state == NVME_CTRL_RESETTING)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
+		nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+			pci_channel_offline(pdev) || !pci_is_enabled(pdev));
 	}
 
 	/*
 	 * Give the controller a chance to complete all entered requests if
 	 * doing a safe shutdown.
 	 */
-	if (!dead) {
-		if (shutdown)
-			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
-	}
+	if (!dead && shutdown)
+		nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 
 	nvme_stop_queues(&dev->ctrl);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 3/8] nvme: Move all IO out of controller reset
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
  2018-05-22 22:03 ` [PATCHv2 1/8] nvme: Sync request queues on reset Keith Busch
  2018-05-22 22:03 ` [PATCHv2 2/8] nvme-pci: Fix queue freeze criteria " Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 4/8] nvme: Allow reset from CONNECTING state Keith Busch
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


IO may be retryable, so don't wait for them in the reset path. These
commands may trigger a reset if that IO expires without a completion,
placing it on the requeue list. Waiting for these would then deadlock
the reset handler.

To fix the theoretical deadlock, this patch unblocks IO submission from
the reset_work when we enter the LIVE state as before, but moves the
waiting to the IO safe scan_work so that the reset_work may proceed to
completion. Since the unfreezing happens in the controller LIVE state,
the nvme device has to track if the queues were frozen now to prevent
incorrect freeze depths.

This patch is also renaming the function 'nvme_dev_add' to a
more appropriate name that describes what it's actually doing:
nvme_alloc_io_tags.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 46 +++++++++++++++++++++++++++++++---------------
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1de68b56b318..1f34f1169dbd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3177,6 +3177,8 @@ static void nvme_scan_work(struct work_struct *work)
 	struct nvme_id_ctrl *id;
 	unsigned nn;
 
+	if (ctrl->ops->update_hw_ctx)
+		ctrl->ops->update_hw_ctx(ctrl);
 	if (ctrl->state != NVME_CTRL_LIVE)
 		return;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c15c2ee7f61a..230c5424b197 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -320,6 +320,7 @@ struct nvme_ctrl_ops {
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
 	int (*reinit_request)(void *data, struct request *rq);
 	void (*stop_ctrl)(struct nvme_ctrl *ctrl);
+	void (*update_hw_ctx)(struct nvme_ctrl *ctrl);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8949ea3609d8..5de8d79bad20 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -99,6 +99,7 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	bool queues_froze;
 
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
@@ -2065,10 +2066,32 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 	}
 }
 
+static void nvme_pci_update_hw_ctx(struct nvme_ctrl *ctrl)
+{
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
+	bool unfreeze;
+
+	mutex_lock(&dev->shutdown_lock);
+	unfreeze = dev->queues_froze;
+	mutex_unlock(&dev->shutdown_lock);
+
+	if (!unfreeze)
+		return;
+
+	nvme_wait_freeze(&dev->ctrl);
+	blk_mq_update_nr_hw_queues(ctrl->tagset, dev->online_queues - 1);
+	nvme_free_queues(dev, dev->online_queues);
+	nvme_unfreeze(&dev->ctrl);
+
+	mutex_lock(&dev->shutdown_lock);
+	dev->queues_froze = false;
+	mutex_unlock(&dev->shutdown_lock);
+}
+
 /*
  * return error value only when tagset allocation failed
  */
-static int nvme_dev_add(struct nvme_dev *dev)
+static int nvme_alloc_io_tags(struct nvme_dev *dev)
 {
 	int ret;
 
@@ -2096,13 +2119,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->ctrl.tagset = &dev->tagset;
 
 		nvme_dbbuf_set(dev);
-	} else {
-		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
-
-		/* Free previously allocated queues that are no longer usable */
-		nvme_free_queues(dev, dev->online_queues);
 	}
-
 	return 0;
 }
 
@@ -2201,7 +2218,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	     dev->ctrl.state == NVME_CTRL_RESETTING)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		nvme_start_freeze(&dev->ctrl);
+		if (!dev->queues_froze)	{
+			nvme_start_freeze(&dev->ctrl);
+			dev->queues_froze = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pci_channel_offline(pdev) || !pci_is_enabled(pdev));
 	}
@@ -2379,13 +2399,8 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
-		/* hit this only when allocate tagset fails */
-		if (nvme_dev_add(dev))
-			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+	} else if (nvme_alloc_io_tags(dev)) {
+		new_state = NVME_CTRL_ADMIN_ONLY;
 	}
 
 	/*
@@ -2450,6 +2465,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read64		= nvme_pci_reg_read64,
 	.free_ctrl		= nvme_pci_free_ctrl,
 	.submit_async_event	= nvme_pci_submit_async_event,
+	.update_hw_ctx		= nvme_pci_update_hw_ctx,
 	.get_address		= nvme_pci_get_address,
 };
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 4/8] nvme: Allow reset from CONNECTING state
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
                   ` (2 preceding siblings ...)
  2018-05-22 22:03 ` [PATCHv2 3/8] nvme: Move all IO out of controller reset Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 5/8] nvme-pci: Attempt reset retry for IO failures Keith Busch
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


A failed connection may be retryable. This patch allows the connecting
state to initiate a reset so that it may try to connect again.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1f34f1169dbd..87a19687be32 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -292,6 +292,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_ADMIN_ONLY:
+		case NVME_CTRL_CONNECTING:
 			changed = true;
 			/* FALLTHRU */
 		default:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 5/8] nvme-pci: Attempt reset retry for IO failures
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
                   ` (3 preceding siblings ...)
  2018-05-22 22:03 ` [PATCHv2 4/8] nvme: Allow reset from CONNECTING state Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 6/8] nvme-pci: Rate limit the nvme timeout warnings Keith Busch
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


If the reset failed due to a non-fatal error, this patch will attempt
to reset the controller again, with a maximum of 4 attempts.

Since the failed reset case has changed purpose, this patch provides a
more appropriate name and warning message for the reset failure.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5de8d79bad20..dfaf22ff6328 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -37,6 +37,8 @@
 
 #define SGES_PER_PAGE	(PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
+#define MAX_RESET_FAILURES 4
+
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0);
 
@@ -101,6 +103,8 @@ struct nvme_dev {
 	struct completion ioq_wait;
 	bool queues_froze;
 
+	int reset_failures;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -2305,9 +2309,23 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+static void nvme_reset_failure(struct nvme_dev *dev, int status)
 {
-	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
+	dev->reset_failures++;
+	dev_warn(dev->ctrl.device, "Reset failure status: %d, failures:%d\n",
+		status, dev->reset_failures);
+
+	/* IO and Interrupted Call may indicate a retryable error */
+	switch (status) {
+	case -EIO:
+	case -EINTR:
+		if (dev->reset_failures < MAX_RESET_FAILURES &&
+		    !nvme_reset_ctrl(&dev->ctrl))
+			return;
+		break;
+	default:
+		break;
+	}
 
 	nvme_get_ctrl(&dev->ctrl);
 	nvme_dev_disable(dev, false);
@@ -2410,14 +2428,16 @@ static void nvme_reset_work(struct work_struct *work)
 	if (!nvme_change_ctrl_state(&dev->ctrl, new_state)) {
 		dev_warn(dev->ctrl.device,
 			"failed to mark controller state %d\n", new_state);
+		result = -ENODEV;
 		goto out;
 	}
 
+	dev->reset_failures = 0;
 	nvme_start_ctrl(&dev->ctrl);
 	return;
 
  out:
-	nvme_remove_dead_ctrl(dev, result);
+	nvme_reset_failure(dev, result);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 6/8] nvme-pci: Rate limit the nvme timeout warnings
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
                   ` (4 preceding siblings ...)
  2018-05-22 22:03 ` [PATCHv2 5/8] nvme-pci: Attempt reset retry for IO failures Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 7/8] nvme-pci: End IO requests in CONNECTING state Keith Busch
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


The block layer's timeout handling currently doesn't let the driver
complete commands outside the timeout callback once blk-mq decides they've
expired. If a device breaks, this could potentially create many thousands
of timed out commands. There's nothing of value to be gleaned from
observing each of those messages, so this patch adds a ratelimit on them.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index dfaf22ff6328..d6576934476a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1235,7 +1235,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	switch (dev->ctrl.state) {
 	case NVME_CTRL_CONNECTING:
 	case NVME_CTRL_RESETTING:
-		dev_warn(dev->ctrl.device,
+		dev_warn_ratelimited(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 7/8] nvme-pci: End IO requests in CONNECTING state
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
                   ` (5 preceding siblings ...)
  2018-05-22 22:03 ` [PATCHv2 6/8] nvme-pci: Rate limit the nvme timeout warnings Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-22 22:03 ` [PATCHv2 8/8] nvme-pci: Unquiesce queues on dead controller Keith Busch
  2018-05-23  3:00 ` [PATCHv2 0/8] nvme timeout fixes v2 Ming Lei
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


The patch is catching a possibility the unlikely block layer doesn't
complete a request the driver completed during the connecting state of
a controller reset.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d6576934476a..84eeac683322 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1234,6 +1234,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	switch (dev->ctrl.state) {
 	case NVME_CTRL_CONNECTING:
+		/*
+		 * IO is never dispatched from the connecting state. If an IO
+		 * queue timed out here, the block layer missed the completion
+		 * the driver already requested, so return handled.
+		 */
+		if (nvmeq->qid)
+			return BLK_EH_HANDLED;
+		/* FALLTHRU */
 	case NVME_CTRL_RESETTING:
 		dev_warn_ratelimited(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 8/8] nvme-pci: Unquiesce queues on dead controller
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
                   ` (6 preceding siblings ...)
  2018-05-22 22:03 ` [PATCHv2 7/8] nvme-pci: End IO requests in CONNECTING state Keith Busch
@ 2018-05-22 22:03 ` Keith Busch
  2018-05-23  3:00 ` [PATCHv2 0/8] nvme timeout fixes v2 Ming Lei
  8 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-22 22:03 UTC (permalink / raw)


This patch ensures the nvme namsepace request queues are not quiesced
on a surprise removal. It's possible the queues were previously killed
in a failed reset, so the queues need to be unquiesced to ensure all
requests are flushed to completion. This is accomplished by setting the
nvme_dev_disable 'shutdown' parameter to true, indicating we're not
going to restart the controller.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 84eeac683322..452d2199b767 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2638,7 +2638,7 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, true);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
                   ` (7 preceding siblings ...)
  2018-05-22 22:03 ` [PATCHv2 8/8] nvme-pci: Unquiesce queues on dead controller Keith Busch
@ 2018-05-23  3:00 ` Ming Lei
  2018-05-23 16:16   ` Keith Busch
  8 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2018-05-23  3:00 UTC (permalink / raw)


On Tue, May 22, 2018@04:03:24PM -0600, Keith Busch wrote:
> While some substantial changes to the blk-mq's timeout handling are
> under consideration, the following should be compatible with existing
> implementation, and current proposals for the future.
> 
> v1 -> v2:
> 
>   Reverse the sync/disable sequence on nvme reset in case a request was
>   prevented from completing in the timeout work, then unconditionally
>   disable the device to reclaim any remaining outstanding tags.
> 
>   Ensure we're not incorrectly clearing the new queue freeze flag.
> 
>   Do not start IO queues within the CONNECTING state. The queues will be
>   started after entering the LIVE state.
> 
>   Fixed the ratelimit to the intended print.
> 
>   A new fix for a very unlikely race where an IO was blocked from
>   completing prior to a reset, timing out yet again in CONNECTING state.
> 
>   A fix for a failed controller recovery that could leave queues quiesced
>   when trying to unbind the driver.
> 
>   Whitespace fixes
> 
> Keith Busch (8):
>   nvme: Sync request queues on reset
>   nvme-pci: Fix queue freeze criteria on reset
>   nvme: Move all IO out of controller reset
>   nvme: Allow reset from CONNECTING state
>   nvme-pci: Attempt reset retry for IO failures
>   nvme-pci: Rate limit the nvme timeout warnings
>   nvme-pci: End IO requests immediately in CONNECTING state
>   nvme-pci: Ensure queues are not quiesced on dead controller
> 
>  drivers/nvme/host/core.c |  24 +++++++++-
>  drivers/nvme/host/nvme.h |   2 +
>  drivers/nvme/host/pci.c  | 117 +++++++++++++++++++++++++++++++++--------------
>  3 files changed, 106 insertions(+), 37 deletions(-)

Hi Keith,

Looks V2 still may trigger IO hang warning:

[  246.812170] INFO: task kworker/u16:12:493 blocked for more than 120 seconds.
[  246.813483]       Tainted: G        W         4.17.0-rc5.mp_bvec_v5+ #508
[  246.814545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.816350] kworker/u16:12  D    0   493      2 0x80000000
[  246.816359] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[  246.816362] Call Trace:
[  246.816368]  ? __schedule+0x660/0x764
[  246.816370]  ? __accumulate_pelt_segments+0x29/0x3a
[  246.816372]  schedule+0x88/0x9b
[  246.816375]  blk_mq_freeze_queue_wait+0x5a/0x9d
[  246.816377]  ? wait_woken+0x6d/0x6d
[  246.816382]  nvme_wait_freeze+0x2d/0x3f [nvme_core]
[  246.816385]  nvme_pci_update_hw_ctx+0x36/0xa6 [nvme]
[  246.816389]  nvme_scan_work+0x42/0x234 [nvme_core]
[  246.816392]  ? check_preempt_curr+0x2a/0x63
[  246.816394]  ? ttwu_do_wakeup.isra.18+0x19/0x134
[  246.816395]  ? _raw_spin_unlock_irqrestore+0x20/0x31
[  246.816397]  ? try_to_wake_up+0x311/0x39e
[  246.816399]  process_one_work+0x18f/0x2e6
[  246.816402]  worker_thread+0x1e3/0x2ab
[  246.816404]  ? rescuer_thread+0x293/0x293
[  246.816406]  kthread+0x113/0x11b
[  246.816407]  ? kthread_create_on_node+0x62/0x62
[  246.816409]  ret_from_fork+0x35/0x40


Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-23  3:00 ` [PATCHv2 0/8] nvme timeout fixes v2 Ming Lei
@ 2018-05-23 16:16   ` Keith Busch
  2018-05-23 22:49     ` Keith Busch
  2018-05-24  3:23     ` Ming Lei
  0 siblings, 2 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-23 16:16 UTC (permalink / raw)


On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
> Looks V2 still may trigger IO hang warning:

Anything in particular that triggered this?

I'm running IO with io-timeout fault injection, with resets and rescans
randomly running in the background, so I'm guessing that's not enough.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-23 16:16   ` Keith Busch
@ 2018-05-23 22:49     ` Keith Busch
  2018-05-24  6:52       ` jianchao.wang
  2018-07-11 23:23       ` James Smart
  2018-05-24  3:23     ` Ming Lei
  1 sibling, 2 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-23 22:49 UTC (permalink / raw)


On Wed, May 23, 2018@10:16:14AM -0600, Keith Busch wrote:
> On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
> > Looks V2 still may trigger IO hang warning:
> 
> Anything in particular that triggered this?
> 
> I'm running IO with io-timeout fault injection, with resets and rescans
> randomly running in the background, so I'm guessing that's not enough.

At the very least, I can see allowing CONNECTING->RESETTING transitions
was a terrible idea. I can get other failures because of that, but not
quite what you're seeing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-23 16:16   ` Keith Busch
  2018-05-23 22:49     ` Keith Busch
@ 2018-05-24  3:23     ` Ming Lei
  2018-05-24 13:57       ` Keith Busch
  1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2018-05-24  3:23 UTC (permalink / raw)


On Wed, May 23, 2018@10:16:14AM -0600, Keith Busch wrote:
> On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
> > Looks V2 still may trigger IO hang warning:
> 
> Anything in particular that triggered this?

I am running the modified block 011:

diff --git a/tests/block/011 b/tests/block/011
index 62e89f758ef1..5f71f8b9aca0 100755
--- a/tests/block/011
+++ b/tests/block/011
@@ -44,10 +44,10 @@ test_device() {
                        --ignore_error=EIO,ENXIO,ENODEV &
 
        while kill -0 $! 2>/dev/null; do
-               echo 0 > "/sys/bus/pci/devices/${pdev}/enable"
-               sleep .2
-               echo 1 > "/sys/bus/pci/devices/${pdev}/enable"
-               sleep .2
+               setpci -s "${pdev}" 4.w=00:04
+               sleep .01
+               setpci -s "${pdev}" 4.w=04:04
+               sleep .01

> 
> I'm running IO with io-timeout fault injection, with resets and rescans
> randomly running in the background, so I'm guessing that's not enough.

io-timeout can't trigger admin IO timeout, just found that yesterday's hang
follows warning of 'Trying to free already-free', so it should be related
with admin queue, and not take a close look at your V2 yet, but seems
you don't address the issues handled by the following patches:

https://marc.info/?l=linux-block&m=152644343805986&w=2
https://marc.info/?l=linux-block&m=152644344805995&w=2
https://marc.info/?l=linux-block&m=152644346006002&w=2


Also today I tried to reproduce yesterday's report, looks not succeed
yet, since the following two oops can be triggered, and the 1st one is
a bit easier to trigger.

[1] kernfs oops
[ming at VM]$sudo ./check block/011
block/011 => nvme0n1 (disable PCI device while doing I/O)   
    runtime  ...
[  348.175967] BUG: unable to handle kernel paging request at 0000000001000000
[  348.176645] PGD 8000000269378067 P4D 8000000269378067 PUD 25b8b8067 PMD 0 
[  348.177274] Oops: 0000 [#1] PREEMPT SMP PTI
[  348.177674] Dumping ftrace buffer:
[  348.178010]    (ftrace buffer empty)
[  348.178363] Modules linked in: ebtable_filter ebtables ip6table_filter ip6_tables target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc iptable_filter fuse ip_tables nbd linear sd_mod sg crc32c_intel usb_storage ahci nvme shpchp libahci libata lpc_ich nvme_core virtio_scsi mfd_core binfmt_misc qemu_fw_cfg dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[  348.182224] CPU: 0 PID: 8912 Comm: setpci Tainted: G        W         4.17.0-rc5.nvme_timeout_keith_V2+ #515
[  348.183103] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[  348.183867] RIP: 0010:prefetch_freepointer.isra.11+0xc/0x14
[  348.184388] RSP: 0018:ffffc900079f3c88 EFLAGS: 00010206
[  348.184877] RAX: 0000000000000000 RBX: 0000000001000000 RCX: 0000000203178000
[  348.185528] RDX: 0000000203176000 RSI: 0000000001000000 RDI: ffff880107c02c20
[  348.186281] RBP: ffff880107c02c00 R08: 00000000000260c0 R09: ffff880274489035
[  348.187182] R10: ffff8802728661b8 R11: 8080808080808080 R12: 00000000014080c0
[  348.188265] R13: 0000000000001000 R14: ffff8802624dc000 R15: ffff880107c02c00
[  348.189260] FS:  00007fab8d40d740(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
[  348.190561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  348.191404] CR2: 0000000001000000 CR3: 00000002770e2003 CR4: 00000000003606f0
[  348.192425] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  348.193432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  348.194489] Call Trace:
[  348.195012]  kmem_cache_alloc_trace+0xd7/0x173
[  348.195782]  ? kernfs_iop_get_link+0x3d/0x18c
[  348.196455]  ? __kernfs_create_file+0xa0/0xa0
[  348.197241]  kernfs_iop_get_link+0x3d/0x18c
[  348.198241]  ? __kernfs_create_file+0xa0/0xa0
[  348.199088]  link_path_walk+0x352/0x42e
[  348.199768]  ? path_init+0x116/0x29c
[  348.200607]  path_openat+0x214/0x286
[  348.201475]  ? memcg_kmem_put_cache+0x4e/0x60
[  348.202672]  do_filp_open+0x5c/0xc6
[  348.203734]  ? do_sys_open+0x14c/0x1e8
[  348.204417]  do_sys_open+0x14c/0x1e8
[  348.205088]  do_syscall_64+0x7e/0x139
[  348.205805]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  348.206623] RIP: 0033:0x7fab8cd40790
[  348.207251] RSP: 002b:00007fffde4b3008 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  348.208535] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fab8cd40790
[  348.209489] RDX: 00007fffde4b3048 RSI: 0000000000000000 RDI: 00007fffde4b3020
[  348.210519] RBP: 0000000000943140 R08: 0000000000000000 R09: 0000000000000028
[  348.211564] R10: 0000000000000006 R11: 0000000000000246 R12: 00007fffde4b3020
[  348.212691] R13: 00007fffde4b3460 R14: 000000000093a010 R15: 000000000093a010
[  348.213787] Code: 89 fb 31 c9 ba 01 00 00 00 e8 b2 f6 ff ff 48 89 ee 48 89 df 31 c9 5b 5d 31 d2 e9 a1 f6 ff ff 0f 1f 44 00 00 48 85 f6 74 09 8b 07 <48> 8b 04 06 0f 18 08 c3 0f 1f 44 00 00 31 c0 c3 0f 1f 44 00 00 
[  348.217009] RIP: prefetch_freepointer.isra.11+0xc/0x14 RSP: ffffc900079f3c88
[  348.218418] CR2: 0000000001000000
[  348.219386] ---[ end trace 344142f7875d69d0 ]---


[2]
[ming at VM]$sudo ./check block/011
[sudo] password for ming:
block/011 => nvme0n1 (disable PCI device while doing I/O)
    runtime  ...
[  172.667675] kernel BUG at drivers/pci/msi.c:352!
[  172.669332] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  172.678379] Dumping ftrace buffer:
[  172.679644]    (ftrace buffer empty)
[  172.682009] Modules linked in: ebtable_filter ebtables ip6table_filter ip6_tables target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc iptable_filter ip_tables fuse nbd linear sd_mod sg usb_storage crc32c_intel ahci libahci shpchp libata lpc_ich mfd_core nvme virtio_scsi nvme_core binfmt_misc qemu_fw_cfg dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[  172.701034] CPU: 0 PID: 99 Comm: kworker/0:1H Tainted: G        W         4.17.0-rc5.nvme_timeout_keith_V2+ #515
[  172.703785] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[  172.706073] Workqueue: kblockd blk_mq_timeout_work
[  172.709777] RIP: 0010:free_msi_irqs+0x4c/0x155
[  172.711275] RSP: 0018:ffffc90001a73be8 EFLAGS: 00010282
[  172.712635] RAX: ffff88025b755800 RBX: ffff88024c118a20 RCX: 0000000000000000
[  172.713904] RDX: ffff88025b755800 RSI: ffffc90001a73bb0 RDI: ffff880107800248
[  172.715160] RBP: ffff88027700a000 R08: ffff8801078003e8 R09: ffff880107800248
[  172.719793] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  172.721568] R13: ffff88027700a298 R14: 0000000000000000 R15: ffff880276842000
[  172.723004] FS:  0000000000000000(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
[  172.725161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  172.731642] CR2: 00007fe1071b1c90 CR3: 000000000200a001 CR4: 00000000003606f0
[  172.734676] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  172.738747] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  172.740079] Call Trace:
[  172.740727]  pci_disable_msix+0xda/0xf8
[  172.742008]  pci_free_irq_vectors+0xe/0x17
[  172.742910]  nvme_dev_disable+0x3b2/0x434 [nvme]
[  172.743899]  ? dev_warn+0x64/0x80
[  172.744760]  nvme_timeout+0x13c/0x2c9 [nvme]
[  172.745733]  ? _raw_spin_unlock_irq+0x1d/0x2e
[  172.746733]  ? wait_for_common+0x120/0x187
[  172.747745]  ? wake_up_q+0x4d/0x4d
[  172.748625]  blk_mq_terminate_expired+0x45/0x84
[  172.749540]  bt_for_each+0xcb/0xfd
[  172.750308]  ? blk_mq_requeue_request+0x53/0x53
[  172.753817]  ? blk_mq_requeue_request+0x53/0x53
[  172.754723]  blk_mq_queue_tag_busy_iter+0x80/0x8d
[  172.755771]  blk_mq_timeout_work+0x132/0x1d2
[  172.757335]  process_one_work+0x18f/0x2e6
[  172.758182]  worker_thread+0x1e3/0x2ab
[  172.758855]  ? rescuer_thread+0x293/0x293
[  172.759528]  kthread+0x113/0x11b
[  172.760131]  ? kthread_create_on_node+0x62/0x62
[  172.760852]  ret_from_fork+0x35/0x40


Thanks,
Ming

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-23 22:49     ` Keith Busch
@ 2018-05-24  6:52       ` jianchao.wang
  2018-07-11 23:23       ` James Smart
  1 sibling, 0 replies; 18+ messages in thread
From: jianchao.wang @ 2018-05-24  6:52 UTC (permalink / raw)


Hi Keith and Ming

On 05/24/2018 06:49 AM, Keith Busch wrote:
> On Wed, May 23, 2018@10:16:14AM -0600, Keith Busch wrote:
>> On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
>>> Looks V2 still may trigger IO hang warning:
>>
>> Anything in particular that triggered this?
>>
>> I'm running IO with io-timeout fault injection, with resets and rescans
>> randomly running in the background, so I'm guessing that's not enough.
> 
> At the very least, I can see allowing CONNECTING->RESETTING transitions
> was a terrible idea. I can get other failures because of that, but not
> quite what you're seeing.
> 

I reproduce this issue on my local.

A simulated io timeout and a script which bind and unbind the nvme card repeatedly.

The dmesg log:
[ 1898.968131] nvme nvme0: Reset failure status: -4, failures:1
[ 1898.988273] queue 8 cq_vector is -1
[ 1898.990369] queue 7 cq_vector is -1
[ 1898.992463] queue 6 cq_vector is -1
[ 1898.993761] queue 5 cq_vector is -1
[ 1898.994264] queue 4 cq_vector is -1
[ 1898.994766] queue 3 cq_vector is -1
[ 1898.995269] queue 2 cq_vector is -1
[ 1898.995801] queue 1 cq_vector is -1
[ 1898.996343] queue 0 cq_vector is -1

this is printed by the nvme_dev_disable, it is added by myself.
from the time, the io 137 on ioq 4 should be timed out when nvme_pci_update_hw_ctx->nvme_wait_freeze

[ 1929.836296] nvme nvme0: I/O 137 QID 4 timeout, aborting
[ 1929.838932] nvme nvme0: Abort status: 0x0
[ 1959.916465] nvme nvme0: I/O 137 QID 4 timeout, reset controller
[ 2040.796290] INFO: task kworker/u16:2:2231 blocked for more than 120 seconds.
[ 2040.799178]       Not tainted 4.17.0-rc5+ #72
[ 2040.801582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2040.802347] kworker/u16:2   D    0  2231      2 0x80000080
[ 2040.803076] Workqueue: nvme-wq nvme_scan_work
[ 2040.803815] Call Trace:
[ 2040.807052]  schedule+0x52/0xe0
[ 2040.807854]  blk_mq_freeze_queue_wait+0xad/0x150
[ 2040.810548]  nvme_wait_freeze+0x42/0x60
[ 2040.811445]  nvme_pci_update_hw_ctx+0x7c/0x150
[ 2040.812407]  nvme_scan_work+0xb0/0x3f0
[ 2040.821562]  process_one_work+0x3ca/0xaa0
[ 2040.824848]  worker_thread+0x89/0x6c0
[ 2040.826050]  kthread+0x18d/0x1e0
[ 2040.829603]  ret_from_fork+0x24/0x30


And some debugfs and sysfs information:

root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state
SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|REGISTERED|QUIESCED
root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx3/cpu3/rq_list 
00000000e78405ab {.op=READ, .cmd_flags=, .rq_flags=DONTPREP|IO_STAT|STATS|MQ_TIMEOUT_EXPIRED, .state=idle, .tag=137, .internal_tag=-1}
root at will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat /sys/block/nvme0n1/device/state
deleting

I suspect that, when the nvme_timeout handle io 137 and queue the reset_work, the nvme_remove cancels the reset work at the same time.
Then nvme queue is left in quiesced state.


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-24  3:23     ` Ming Lei
@ 2018-05-24 13:57       ` Keith Busch
  2018-05-24 15:04         ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2018-05-24 13:57 UTC (permalink / raw)


On Thu, May 24, 2018@11:23:07AM +0800, Ming Lei wrote:
> On Wed, May 23, 2018@10:16:14AM -0600, Keith Busch wrote:
> > On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
> > > Looks V2 still may trigger IO hang warning:
> > 
> > Anything in particular that triggered this?
> 
> I am running the modified block 011:
> 
> diff --git a/tests/block/011 b/tests/block/011
> index 62e89f758ef1..5f71f8b9aca0 100755
> --- a/tests/block/011
> +++ b/tests/block/011
> @@ -44,10 +44,10 @@ test_device() {
>                         --ignore_error=EIO,ENXIO,ENODEV &
>  
>         while kill -0 $! 2>/dev/null; do
> -               echo 0 > "/sys/bus/pci/devices/${pdev}/enable"
> -               sleep .2
> -               echo 1 > "/sys/bus/pci/devices/${pdev}/enable"
> -               sleep .2
> +               setpci -s "${pdev}" 4.w=00:04
> +               sleep .01
> +               setpci -s "${pdev}" 4.w=04:04
> +               sleep .01

Okay, fair enough. Easy enuogh to fix.

Just a reality check here, what is your expectation for this test? Any
driver should just give up on device truly behaving this way should.

> io-timeout can't trigger admin IO timeout, just found that yesterday's hang
> follows warning of 'Trying to free already-free', so it should be related
> with admin queue, and not take a close look at your V2 yet, but seems
> you don't address the issues handled by the following patches:
> 
> https://marc.info/?l=linux-block&m=152644343805986&w=2
> https://marc.info/?l=linux-block&m=152644344805995&w=2
> https://marc.info/?l=linux-block&m=152644346006002&w=2

Right, I had said those patches were fine. I wasn't trying to duplicate
them here, but I'll be sure to do so in the next round.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-24 13:57       ` Keith Busch
@ 2018-05-24 15:04         ` Ming Lei
  2018-05-24 15:16           ` Keith Busch
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2018-05-24 15:04 UTC (permalink / raw)


On Thu, May 24, 2018 at 9:57 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Thu, May 24, 2018@11:23:07AM +0800, Ming Lei wrote:
>> On Wed, May 23, 2018@10:16:14AM -0600, Keith Busch wrote:
>> > On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
>> > > Looks V2 still may trigger IO hang warning:
>> >
>> > Anything in particular that triggered this?
>>
>> I am running the modified block 011:
>>
>> diff --git a/tests/block/011 b/tests/block/011
>> index 62e89f758ef1..5f71f8b9aca0 100755
>> --- a/tests/block/011
>> +++ b/tests/block/011
>> @@ -44,10 +44,10 @@ test_device() {
>>                         --ignore_error=EIO,ENXIO,ENODEV &
>>
>>         while kill -0 $! 2>/dev/null; do
>> -               echo 0 > "/sys/bus/pci/devices/${pdev}/enable"
>> -               sleep .2
>> -               echo 1 > "/sys/bus/pci/devices/${pdev}/enable"
>> -               sleep .2
>> +               setpci -s "${pdev}" 4.w=00:04
>> +               sleep .01
>> +               setpci -s "${pdev}" 4.w=04:04
>> +               sleep .01
>
> Okay, fair enough. Easy enuogh to fix.
>
> Just a reality check here, what is your expectation for this test? Any
> driver should just give up on device truly behaving this way should.

No IO hang, no oops, and either pass or fail should be fine.

Do you think this is reasonable?


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-24 15:04         ` Ming Lei
@ 2018-05-24 15:16           ` Keith Busch
  0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2018-05-24 15:16 UTC (permalink / raw)


On Thu, May 24, 2018@11:04:29PM +0800, Ming Lei wrote:
> > Okay, fair enough. Easy enuogh to fix.
> >
> > Just a reality check here, what is your expectation for this test? Any
> > driver should just give up on device truly behaving this way should.
> 
> No IO hang, no oops, and either pass or fail should be fine.
> 
> Do you think this is reasonable?

Great, that is definitely reasonable.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 0/8] nvme timeout fixes v2
  2018-05-23 22:49     ` Keith Busch
  2018-05-24  6:52       ` jianchao.wang
@ 2018-07-11 23:23       ` James Smart
  1 sibling, 0 replies; 18+ messages in thread
From: James Smart @ 2018-07-11 23:23 UTC (permalink / raw)

On 5/23/2018 3:49 PM, Keith Busch wrote:
> On Wed, May 23, 2018@10:16:14AM -0600, Keith Busch wrote:
>> On Wed, May 23, 2018@11:00:59AM +0800, Ming Lei wrote:
>>> Looks V2 still may trigger IO hang warning:
>> Anything in particular that triggered this?
>>
>> I'm running IO with io-timeout fault injection, with resets and rescans
>> randomly running in the background, so I'm guessing that's not enough.
> At the very least, I can see allowing CONNECTING->RESETTING transitions
> was a terrible idea. I can get other failures because of that, but not
> quite what you're seeing.

Is there another way to resolve these terrible ideas ?

I can see this transition as being a very useful for the case where I'm 
in CONNECTING but one of the commands to init the controller gets an io 
error or timeout (ex fabric Connect command gets lost) thus I need to 
"reset" (kill all io for current association) and restart CONNECTING.? 
With the transition, I can use the singular code path that is there 
already. Without it, I have to duplicate the reset entry point so I can 
call it, perform the same work, yet stay in the CONNECTING state - which 
seems an odd thing to do.

Note: this may be related to the timeout handler. in my case, I always 
return BLK_EH_RESET_TIMER as the best the timeout handler can do is 
initiate the hw action to abort the io so that it will finish sometime 
in the very near future and to request the controller to reset/change 
state. The reset never finishes until all ios complete, which means it 
waits for those aborted ios to come back from the hw.

-- james

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-07-11 23:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-22 22:03 [PATCHv2 0/8] nvme timeout fixes v2 Keith Busch
2018-05-22 22:03 ` [PATCHv2 1/8] nvme: Sync request queues on reset Keith Busch
2018-05-22 22:03 ` [PATCHv2 2/8] nvme-pci: Fix queue freeze criteria " Keith Busch
2018-05-22 22:03 ` [PATCHv2 3/8] nvme: Move all IO out of controller reset Keith Busch
2018-05-22 22:03 ` [PATCHv2 4/8] nvme: Allow reset from CONNECTING state Keith Busch
2018-05-22 22:03 ` [PATCHv2 5/8] nvme-pci: Attempt reset retry for IO failures Keith Busch
2018-05-22 22:03 ` [PATCHv2 6/8] nvme-pci: Rate limit the nvme timeout warnings Keith Busch
2018-05-22 22:03 ` [PATCHv2 7/8] nvme-pci: End IO requests in CONNECTING state Keith Busch
2018-05-22 22:03 ` [PATCHv2 8/8] nvme-pci: Unquiesce queues on dead controller Keith Busch
2018-05-23  3:00 ` [PATCHv2 0/8] nvme timeout fixes v2 Ming Lei
2018-05-23 16:16   ` Keith Busch
2018-05-23 22:49     ` Keith Busch
2018-05-24  6:52       ` jianchao.wang
2018-07-11 23:23       ` James Smart
2018-05-24  3:23     ` Ming Lei
2018-05-24 13:57       ` Keith Busch
2018-05-24 15:04         ` Ming Lei
2018-05-24 15:16           ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).