linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] NVMe: Hotplug support
@ 2013-12-30 10:27 Santosh Y
  2013-12-30 10:27 ` [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction Santosh Y
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Santosh Y @ 2013-12-30 10:27 UTC (permalink / raw)


This patchset provides hotplug support for NVMe.
For NVMe hotplug feature to work, PCIe slot must be hotplug capable.

Santosh Y (5):
  NVMe: Code cleanup and minor checkpatch correction
  NVMe: Basic NVMe device hotplug support
  NVMe: Asynchronous device scan support
  NVMe: Stale node cleanup based on reference count
  NVMe: Hotplug support during hibernate/sleep states

 drivers/block/Kconfig     |   8 ++
 drivers/block/nvme-core.c | 334 +++++++++++++++++++++++++++++++++++++++++++---
 include/linux/nvme.h      |  15 +++
 3 files changed, 339 insertions(+), 18 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction
  2013-12-30 10:27 [PATCH RFC 0/5] NVMe: Hotplug support Santosh Y
@ 2013-12-30 10:27 ` Santosh Y
  2013-12-30 13:32   ` Matthew Wilcox
  2013-12-30 14:52   ` Keith Busch
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Santosh Y @ 2013-12-30 10:27 UTC (permalink / raw)


Remove redunduant 'dev->node' deletion which is being
handled in nvme_dev_shutdown() and a minor checkpatch error.

Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
Signed-off-by: Santosh Y <santoshsy at gmail.com>

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b59a93a..a523296 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -2120,7 +2120,7 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
 
 struct nvme_delq_ctx {
 	struct task_struct *waiter;
-	struct kthread_worker* worker;
+	struct kthread_worker *worker;
 	atomic_t refcount;
 };
 
@@ -2556,10 +2556,6 @@ static void nvme_remove(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
 
-	spin_lock(&dev_list_lock);
-	list_del_init(&dev->node);
-	spin_unlock(&dev_list_lock);
-
 	pci_set_drvdata(pdev, NULL);
 	flush_work(&dev->reset_work);
 	misc_deregister(&dev->miscdev);
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 [PATCH RFC 0/5] NVMe: Hotplug support Santosh Y
  2013-12-30 10:27 ` [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction Santosh Y
@ 2013-12-30 10:27 ` Santosh Y
  2013-12-30 13:46   ` Matthew Wilcox
                     ` (5 more replies)
  2013-12-30 10:27 ` [PATCH RFC 3/5] NVMe: Asynchronous device scan support Santosh Y
                   ` (2 subsequent siblings)
  4 siblings, 6 replies; 23+ messages in thread
From: Santosh Y @ 2013-12-30 10:27 UTC (permalink / raw)


This patch provides basic hotplug support for NVMe.
For NVMe hotplug to work the PCIe slot must be hotplug capable.

When a NVMe device is surprise removed and inserted back the
device may need some time to respond to host IO commands, and
will return NVME_SC_NS_NOT_READY. In this case the requests
will be requeued until the device responds to the IO commands
with status response or until the commands timeout.

Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
Signed-off-by: Santosh Y <santoshsy at gmail.com>

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 86b9f37..f92ec96 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -319,6 +319,14 @@ config BLK_DEV_NVME
 	  To compile this driver as a module, choose M here: the
 	  module will be called nvme.
 
+config BLK_DEV_NVME_HP
+	bool "Enable hotplug support"
+	depends on BLK_DEV_NVME && HOTPLUG_PCI_PCIE
+	default n
+	help
+	  If you say Y here, the driver will support hotplug feature.
+	  Hotplug only works if the PCIe slot is hotplug capable.
+
 config BLK_DEV_SKD
 	tristate "STEC S1120 Block Driver"
 	depends on PCI
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index a523296..8a02135 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -172,6 +172,13 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
 	return cmdid;
 }
 
+static inline bool nvme_check_surprise_removal(struct nvme_dev *dev)
+{
+	if (readl(&dev->bar->csts) == -1)
+		return true;
+	return false;
+}
+
 static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
 				nvme_completion_fn handler, unsigned timeout)
 {
@@ -370,6 +377,19 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
 	part_stat_unlock();
 }
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+static void nvme_requeue_bio(struct nvme_dev *dev, struct bio *bio)
+{
+	struct nvme_queue *nvmeq = get_nvmeq(dev);
+	if (bio_list_empty(&nvmeq->sq_cong))
+		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
+	bio_list_add(&nvmeq->sq_cong, bio);
+	put_nvmeq(nvmeq);
+	wake_up_process(nvme_thread);
+}
+#endif
+
+
 static void bio_completion(struct nvme_dev *dev, void *ctx,
 						struct nvme_completion *cqe)
 {
@@ -383,10 +403,19 @@ static void bio_completion(struct nvme_dev *dev, void *ctx,
 		nvme_end_io_acct(bio, iod->start_time);
 	}
 	nvme_free_iod(dev, iod);
-	if (status)
-		bio_endio(bio, -EIO);
-	else
+	if (status) {
+#ifdef CONFIG_BLK_DEV_NVME_HP
+		if ((status & 0xff) == NVME_SC_INVALID_NS) {
+			bio_endio(bio, -ENODEV);
+		} else if ((status & 0xff) == NVME_SC_NS_NOT_READY) {
+			bio->bi_rw |= REQ_FAILFAST_DRIVER;
+			nvme_requeue_bio(dev, bio);
+		} else
+#endif
+			bio_endio(bio, -EIO);
+	} else {
 		bio_endio(bio, 0);
+	}
 }
 
 /* length is in bytes.  gfp flags indicates whether we may sleep. */
@@ -722,6 +751,10 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	if ((bio->bi_rw & REQ_FLUSH) && !psegs)
 		return nvme_submit_flush(nvmeq, ns, cmdid);
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (bio->bi_rw & REQ_FAILFAST_DRIVER)
+		mdelay(100);
+#endif
 	control = 0;
 	if (bio->bi_rw & REQ_FUA)
 		control |= NVME_RW_FUA;
@@ -814,10 +847,26 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 
 static void nvme_make_request(struct request_queue *q, struct bio *bio)
 {
-	struct nvme_ns *ns = q->queuedata;
-	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
+	struct nvme_ns *ns = NULL;
+	struct nvme_queue *nvmeq = NULL;
 	int result = -EBUSY;
 
+	if (likely(q && q->queuedata))
+		ns = q->queuedata;
+	if (unlikely(!ns)) {
+		bio_endio(bio, -ENODEV);
+		return;
+	}
+
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
+		!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
+		bio_endio(bio, -ENODEV);
+		return;
+	}
+#endif
+	nvmeq = get_nvmeq(ns->dev);
+
 	if (!nvmeq) {
 		put_nvmeq(NULL);
 		bio_endio(bio, -EIO);
@@ -1120,6 +1169,12 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
 			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
 		};
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+		if (test_bit(NVME_HOT_REM, &nvmeq->dev->hp_flag)) {
+			cqe.status |= (NVME_SC_INVALID_NS << 1);
+			info[cmdid].timeout = jiffies - NVME_IO_TIMEOUT;
+		}
+#endif
 		if (timeout && !time_after(now, info[cmdid].timeout))
 			continue;
 		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
@@ -1205,7 +1260,7 @@ static void nvme_disable_queue(struct nvme_dev *dev, int qid)
 
 	/* Don't tell the adapter to delete the admin queue.
 	 * Don't tell a removed adapter to delete IO queues. */
-	if (qid && readl(&dev->bar->csts) != -1) {
+	if (qid && !nvme_check_surprise_removal(dev)) {
 		adapter_delete_sq(dev, qid);
 		adapter_delete_cq(dev, qid);
 	}
@@ -1724,6 +1779,13 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
 		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
 		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+		if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
+			!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
+			bio_endio(bio, -ENODEV);
+			continue;
+		}
+#endif
 		if (bio_list_empty(&nvmeq->sq_cong))
 			remove_wait_queue(&nvmeq->sq_full,
 							&nvmeq->sq_cong_wait);
@@ -1746,7 +1808,8 @@ static int nvme_kthread(void *data)
 		spin_lock(&dev_list_lock);
 		list_for_each_entry_safe(dev, next, &dev_list, node) {
 			int i;
-			if (readl(&dev->bar->csts) & NVME_CSTS_CFS &&
+			if (!nvme_check_surprise_removal(dev) &&
+				readl(&dev->bar->csts) & NVME_CSTS_CFS &&
 							dev->initialized) {
 				if (work_busy(&dev->reset_work))
 					continue;
@@ -2082,7 +2145,7 @@ static int nvme_dev_map(struct nvme_dev *dev)
 	dev->bar = ioremap(pci_resource_start(pdev, 0), 8192);
 	if (!dev->bar)
 		goto disable;
-	if (readl(&dev->bar->csts) == -1) {
+	if (nvme_check_surprise_removal(dev)) {
 		result = -ENODEV;
 		goto unmap;
 	}
@@ -2265,7 +2328,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 	list_del_init(&dev->node);
 	spin_unlock(&dev_list_lock);
 
-	if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
+	if (!dev->bar || (dev->bar && nvme_check_surprise_removal(dev))) {
 		for (i = dev->queue_count - 1; i >= 0; i--) {
 			struct nvme_queue *nvmeq = dev->queues[i];
 			nvme_suspend_queue(nvmeq);
@@ -2534,6 +2597,12 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev->initialized = 1;
 	kref_init(&dev->kref);
+
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!pdev->is_added)
+		dev_info(&pdev->dev,
+			"Device 0x%x is on-line\n", pdev->device);
+#endif
 	return 0;
 
  remove:
@@ -2556,6 +2625,16 @@ static void nvme_remove(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!pdev || !dev)
+		return;
+	if (nvme_check_surprise_removal(dev)) {
+		set_bit(NVME_HOT_REM, &dev->hp_flag);
+		dev_info(&pdev->dev,
+			"Surprise removal of device 0x%x\n", pdev->device);
+	}
+	pci_dev_get(pdev);
+#endif
 	pci_set_drvdata(pdev, NULL);
 	flush_work(&dev->reset_work);
 	misc_deregister(&dev->miscdev);
@@ -2565,6 +2644,9 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_release_instance(dev);
 	nvme_release_prp_pools(dev);
 	kref_put(&dev->kref, nvme_free_dev);
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	pci_dev_put(pdev);
+#endif
 }
 
 /* These functions are yet to be implemented */
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 69ae03f..4ef375e 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -68,6 +68,12 @@ enum {
 
 #define NVME_IO_TIMEOUT	(5 * HZ)
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+enum {
+	NVME_HOT_REM,
+};
+#endif
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -97,6 +103,9 @@ struct nvme_dev {
 	u16 oncs;
 	u16 abort_limit;
 	u8 initialized;
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	unsigned long hp_flag;
+#endif
 };
 
 /*
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2013-12-30 10:27 [PATCH RFC 0/5] NVMe: Hotplug support Santosh Y
  2013-12-30 10:27 ` [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction Santosh Y
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
@ 2013-12-30 10:27 ` Santosh Y
  2013-12-30 13:50   ` Matthew Wilcox
  2013-12-30 10:27 ` [PATCH RFC 4/5] NVMe: Stale node cleanup based on reference count Santosh Y
  2013-12-30 10:27 ` [PATCH RFC 5/5] NVMe: Hotplug support during hibernate/sleep states Santosh Y
  4 siblings, 1 reply; 23+ messages in thread
From: Santosh Y @ 2013-12-30 10:27 UTC (permalink / raw)


This patch provides asynchronous device enumeration
capability. The 'probe' need not wait until the namespace scanning is
complete.

Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
Signed-off-by: Santosh Y <santoshsy at gmail.com>

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 8a02135..cd37335 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -17,6 +17,9 @@
  */
 
 #include <linux/nvme.h>
+#ifdef CONFIG_BLK_DEV_NVME_HP
+#include <linux/async.h>
+#endif
 #include <linux/bio.h>
 #include <linux/bitops.h>
 #include <linux/blkdev.h>
@@ -2115,8 +2118,10 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (ns)
 			list_add_tail(&ns->list, &dev->namespaces);
 	}
+#ifndef CONFIG_BLK_DEV_NVME_HP
 	list_for_each_entry(ns, &dev->namespaces, list)
 		add_disk(ns->disk);
+#endif
 	res = 0;
 
  out:
@@ -2546,6 +2551,19 @@ static void nvme_reset_failed_dev(struct work_struct *ws)
 	nvme_dev_reset(dev);
 }
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+static void nvme_async_add(void *data, async_cookie_t cookie)
+{
+	struct nvme_dev *dev = (struct nvme_dev *)data;
+	struct nvme_ns *ns;
+
+	list_for_each_entry(ns, &dev->namespaces, list)
+		add_disk(ns->disk);
+	if (!test_bit(NVME_HOT_REM, &dev->hp_flag))
+		dev->initialized = 1;
+}
+#endif
+
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	int result = -ENOMEM;
@@ -2595,14 +2613,16 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto remove;
 
-	dev->initialized = 1;
-	kref_init(&dev->kref);
-
 #ifdef CONFIG_BLK_DEV_NVME_HP
+	async_schedule(nvme_async_add, dev);
 	if (!pdev->is_added)
 		dev_info(&pdev->dev,
 			"Device 0x%x is on-line\n", pdev->device);
+#else
+	dev->initialized = 1;
 #endif
+	kref_init(&dev->kref);
+
 	return 0;
 
  remove:
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 4/5] NVMe: Stale node cleanup based on reference count
  2013-12-30 10:27 [PATCH RFC 0/5] NVMe: Hotplug support Santosh Y
                   ` (2 preceding siblings ...)
  2013-12-30 10:27 ` [PATCH RFC 3/5] NVMe: Asynchronous device scan support Santosh Y
@ 2013-12-30 10:27 ` Santosh Y
  2013-12-30 14:00   ` Matthew Wilcox
  2013-12-30 10:27 ` [PATCH RFC 5/5] NVMe: Hotplug support during hibernate/sleep states Santosh Y
  4 siblings, 1 reply; 23+ messages in thread
From: Santosh Y @ 2013-12-30 10:27 UTC (permalink / raw)


This patch maintains reference count for namespaces that are still
being referenced by user applications during surprise removal.
Once the reference count is zero, the instance release and cleanup will be
handled.

Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
Signed-off-by: Santosh Y <santoshsy at gmail.com>

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index cd37335..48698b7 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -51,6 +51,10 @@
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
 #define ADMIN_TIMEOUT	(60 * HZ)
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+#define NVME_MINORS 64
+#endif
+
 static int nvme_major;
 module_param(nvme_major, int, 0);
 
@@ -62,6 +66,11 @@ static LIST_HEAD(dev_list);
 static struct task_struct *nvme_thread;
 static struct workqueue_struct *nvme_workq;
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+static DEFINE_SPINLOCK(stalen_lock);
+static LIST_HEAD(stalen_list);
+#endif
+
 static void nvme_reset_failed_dev(struct work_struct *ws);
 
 struct async_cmd_info {
@@ -1737,6 +1746,11 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
 {
 	struct nvme_ns *ns = bdev->bd_disk->private_data;
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!ns || (test_bit(NVME_HOT_REM, &ns->dev->hp_flag)) ||
+		!bdev->bd_disk || !(bdev->bd_disk->flags & GENHD_FL_UP))
+		return -ENODEV;
+#endif
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
@@ -1770,8 +1784,42 @@ static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
 #define nvme_compat_ioctl	NULL
 #endif
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+static int nvme_bd_open(struct block_device *bdev, fmode_t mode)
+{
+	struct nvme_ns *ns;
+	int err = -ENODEV;
+
+	if (!bdev || !bdev->bd_disk ||
+		!bdev->bd_disk->private_data)
+		goto out;
+	if ((bdev->bd_disk->flags & GENHD_FL_UP)) {
+		ns = (struct nvme_ns *)bdev->bd_disk->private_data;
+		atomic_inc(&ns->refcount);
+		err = 0;
+	}
+out:
+	return err;
+}
+static void nvme_bd_release(struct gendisk *gdisk, fmode_t mode)
+{
+	struct nvme_ns *ns;
+
+	if (!gdisk || !gdisk->private_data)
+		goto out;
+	ns = (struct nvme_ns *)gdisk->private_data;
+	atomic_dec(&ns->refcount);
+out:
+	return;
+}
+#endif
+
 static const struct block_device_operations nvme_fops = {
 	.owner		= THIS_MODULE,
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	.open           = nvme_bd_open,
+	.release        = nvme_bd_release,
+#endif
 	.ioctl		= nvme_ioctl,
 	.compat_ioctl	= nvme_compat_ioctl,
 };
@@ -1805,6 +1853,9 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
 static int nvme_kthread(void *data)
 {
 	struct nvme_dev *dev, *next;
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	unsigned long flags  = 0;
+#endif
 
 	while (!kthread_should_stop()) {
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -1827,14 +1878,22 @@ static int nvme_kthread(void *data)
 				struct nvme_queue *nvmeq = dev->queues[i];
 				if (!nvmeq)
 					continue;
+#ifdef CONFIG_BLK_DEV_NVME_HP
+				spin_lock_irqsave(&nvmeq->q_lock, flags);
+#else
 				spin_lock_irq(&nvmeq->q_lock);
+#endif
 				if (nvmeq->q_suspended)
 					goto unlock;
 				nvme_process_cq(nvmeq);
 				nvme_cancel_ios(nvmeq, true);
 				nvme_resubmit_bios(nvmeq);
  unlock:
+#ifdef CONFIG_BLK_DEV_NVME_HP
+				spin_unlock_irqrestore(&nvmeq->q_lock, flags);
+#else
 				spin_unlock_irq(&nvmeq->q_lock);
+#endif
 			}
 		}
 		spin_unlock(&dev_list_lock);
@@ -1843,6 +1902,33 @@ static int nvme_kthread(void *data)
 	return 0;
 }
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+static DEFINE_IDA(nvme_index_ida);
+
+static int nvme_get_ns_idx(void)
+{
+	int index, error;
+	do {
+		if (!ida_pre_get(&nvme_index_ida, GFP_KERNEL))
+			return -1;
+		spin_lock(&dev_list_lock);
+		error = ida_get_new(&nvme_index_ida, &index);
+		spin_unlock(&dev_list_lock);
+	} while (error == -EAGAIN);
+
+	if (error)
+		index = -1;
+	return index;
+}
+
+static void nvme_put_ns_idx(int index)
+{
+	spin_lock(&dev_list_lock);
+	ida_remove(&nvme_index_ida, index);
+	spin_unlock(&dev_list_lock);
+}
+#endif
+
 static void nvme_config_discard(struct nvme_ns *ns)
 {
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1876,7 +1962,11 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	ns->dev = dev;
 	ns->queue->queuedata = ns;
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	disk = alloc_disk(NVME_MINORS);
+#else
 	disk = alloc_disk(0);
+#endif
 	if (!disk)
 		goto out_free_queue;
 	ns->ns_id = nsid;
@@ -1889,12 +1979,19 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 		blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);
 
 	disk->major = nvme_major;
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	disk->minors = NVME_MINORS;
+	disk->first_minor = NVME_MINORS * nvme_get_ns_idx();
+#else
 	disk->first_minor = 0;
+#endif
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
 	disk->queue = ns->queue;
 	disk->driverfs_dev = &dev->pci_dev->dev;
+#ifndef CONFIG_BLK_DEV_NVME_HP
 	disk->flags = GENHD_FL_EXT_DEVT;
+#endif
 	sprintf(disk->disk_name, "nvme%dn%d", dev->instance, nsid);
 	set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
 
@@ -1912,7 +2009,13 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 
 static void nvme_ns_free(struct nvme_ns *ns)
 {
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	int index = ns->disk->first_minor / NVME_MINORS;
+#endif
 	put_disk(ns->disk);
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	nvme_put_ns_idx(index);
+#endif
 	blk_cleanup_queue(ns->queue);
 	kfree(ns);
 }
@@ -2352,10 +2455,27 @@ static void nvme_dev_remove(struct nvme_dev *dev)
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &dev->namespaces, list) {
-		list_del(&ns->list);
-		del_gendisk(ns->disk);
+		if (ns->disk->flags & GENHD_FL_UP)
+			del_gendisk(ns->disk);
+#ifdef CONFIG_BLK_DEV_NVME_HP
+		if (!(atomic_read(&ns->refcount))) {
+			list_del(&ns->list);
+			nvme_ns_free(ns);
+		} else {
+			set_bit(NVME_STALE_NODE, &dev->hp_flag);
+		}
+#else
 		nvme_ns_free(ns);
+#endif
 	}
+
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (test_bit(NVME_STALE_NODE, &dev->hp_flag)) {
+		spin_lock(&stalen_lock);
+		list_add(&dev->stale_node, &stalen_list);
+		spin_unlock(&stalen_lock);
+	}
+#endif
 }
 
 static int nvme_setup_prp_pools(struct nvme_dev *dev)
@@ -2411,12 +2531,45 @@ static void nvme_release_instance(struct nvme_dev *dev)
 	spin_unlock(&dev_list_lock);
 }
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+static void nvme_remove_stalen(void)
+{
+	struct nvme_ns *ns, *next;
+	struct nvme_dev *dev, *dev_next;
+	int ns_count = 0, ns_free_count = 0;
+
+	list_for_each_entry_safe(dev, dev_next, &stalen_list, stale_node) {
+		list_for_each_entry_safe(ns, next, &dev->namespaces, list) {
+			++ns_count;
+			if (ns && (!atomic_read(&ns->refcount))) {
+				list_del_init(&ns->list);
+				nvme_ns_free(ns);
+				++ns_free_count;
+			}
+		}
+
+		if (ns_count == ns_free_count)
+			clear_bit(NVME_STALE_NODE, &dev->hp_flag);
+		if (!test_bit(NVME_STALE_NODE, &dev->hp_flag)) {
+			spin_lock(&stalen_lock);
+			list_del(&dev->stale_node);
+			spin_unlock(&stalen_lock);
+			nvme_release_instance(dev);
+			kfree(dev);
+		}
+	}
+}
+#endif
+
 static void nvme_free_dev(struct kref *kref)
 {
 	struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);
 	kfree(dev->queues);
 	kfree(dev->entry);
-	kfree(dev);
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!test_bit(NVME_STALE_NODE, &dev->hp_flag))
+#endif
+		kfree(dev);
 }
 
 static int nvme_dev_open(struct inode *inode, struct file *f)
@@ -2431,6 +2584,11 @@ static int nvme_dev_open(struct inode *inode, struct file *f)
 static int nvme_dev_release(struct inode *inode, struct file *f)
 {
 	struct nvme_dev *dev = f->private_data;
+
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!dev)
+		return -ENODEV;
+#endif
 	kref_put(&dev->kref, nvme_free_dev);
 	return 0;
 }
@@ -2438,6 +2596,11 @@ static int nvme_dev_release(struct inode *inode, struct file *f)
 static long nvme_dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
 	struct nvme_dev *dev = f->private_data;
+
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!dev)
+		return -ENODEV;
+#endif
 	switch (cmd) {
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_admin_cmd(dev, (void __user *)arg);
@@ -2569,6 +2732,9 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int result = -ENOMEM;
 	struct nvme_dev *dev;
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	nvme_remove_stalen();
+#endif
 	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
 	if (!dev)
 		return -ENOMEM;
@@ -2582,6 +2748,9 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto free;
 
 	INIT_LIST_HEAD(&dev->namespaces);
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	INIT_LIST_HEAD(&dev->stale_node);
+#endif
 	dev->pci_dev = pdev;
 	pci_set_drvdata(pdev, dev);
 	result = nvme_set_instance(dev);
@@ -2652,6 +2821,7 @@ static void nvme_remove(struct pci_dev *pdev)
 		set_bit(NVME_HOT_REM, &dev->hp_flag);
 		dev_info(&pdev->dev,
 			"Surprise removal of device 0x%x\n", pdev->device);
+		dev->initialized = 0;
 	}
 	pci_dev_get(pdev);
 #endif
@@ -2661,7 +2831,10 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_dev_remove(dev);
 	nvme_dev_shutdown(dev);
 	nvme_free_queues(dev, 0);
-	nvme_release_instance(dev);
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!test_bit(NVME_STALE_NODE, &dev->hp_flag))
+#endif
+		nvme_release_instance(dev);
 	nvme_release_prp_pools(dev);
 	kref_put(&dev->kref, nvme_free_dev);
 #ifdef CONFIG_BLK_DEV_NVME_HP
@@ -2762,6 +2935,9 @@ static int __init nvme_init(void)
 
 static void __exit nvme_exit(void)
 {
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	nvme_remove_stalen();
+#endif
 	pci_unregister_driver(&nvme_driver);
 	unregister_blkdev(nvme_major, "nvme");
 	destroy_workqueue(nvme_workq);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 4ef375e..eb0d400 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -71,6 +71,7 @@ enum {
 #ifdef CONFIG_BLK_DEV_NVME_HP
 enum {
 	NVME_HOT_REM,
+	NVME_STALE_NODE,
 };
 #endif
 
@@ -105,6 +106,7 @@ struct nvme_dev {
 	u8 initialized;
 #ifdef CONFIG_BLK_DEV_NVME_HP
 	unsigned long hp_flag;
+	struct list_head stale_node;
 #endif
 };
 
@@ -123,6 +125,10 @@ struct nvme_ns {
 	int ms;
 	u64 mode_select_num_blocks;
 	u32 mode_select_block_len;
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	atomic_t refcount;
+#endif
+
 };
 
 /*
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 5/5] NVMe: Hotplug support during hibernate/sleep states
  2013-12-30 10:27 [PATCH RFC 0/5] NVMe: Hotplug support Santosh Y
                   ` (3 preceding siblings ...)
  2013-12-30 10:27 ` [PATCH RFC 4/5] NVMe: Stale node cleanup based on reference count Santosh Y
@ 2013-12-30 10:27 ` Santosh Y
  4 siblings, 0 replies; 23+ messages in thread
From: Santosh Y @ 2013-12-30 10:27 UTC (permalink / raw)


This patch provides NVMe hotplug support during
hibernate/sleep states.

Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
Signed-off-by: Santosh Y <santoshsy at gmail.com>

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 48698b7..9fbc063 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -2863,6 +2863,30 @@ static int nvme_resume(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+#ifdef CONFIG_BLK_DEV_NVME_HP
+	if (!ndev)
+		return -EINVAL;
+	
+	if (nvme_check_surprise_removal(ndev)) {
+		set_bit(NVME_HOT_REM, &ndev->hp_flag);
+		pci_dev_get(pdev);
+		pci_set_drvdata(pdev, NULL);
+		flush_work(&ndev->reset_work);
+		misc_deregister(&ndev->miscdev);
+		nvme_dev_remove(ndev);
+		nvme_dev_shutdown(ndev);
+		nvme_free_queues(ndev, 0);
+
+		if (!test_bit(NVME_STALE_NODE, &ndev->hp_flag))
+			nvme_release_instance(ndev);
+		nvme_release_prp_pools(ndev);
+		kref_put(&ndev->kref, nvme_free_dev);
+		pci_dev_put(pdev);
+		nvme_remove_stalen();
+		return -ENODEV;
+	}
+#endif
+
 	if (nvme_dev_resume(ndev) && !work_busy(&ndev->reset_work)) {
 		INIT_WORK(&ndev->reset_work, nvme_reset_failed_dev);
 		queue_work(nvme_workq, &ndev->reset_work);
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction
  2013-12-30 10:27 ` [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction Santosh Y
@ 2013-12-30 13:32   ` Matthew Wilcox
  2013-12-30 14:52   ` Keith Busch
  1 sibling, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2013-12-30 13:32 UTC (permalink / raw)


On Mon, Dec 30, 2013@03:57:16PM +0530, Santosh Y wrote:
> @@ -2120,7 +2120,7 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
>  
>  struct nvme_delq_ctx {
>  	struct task_struct *waiter;
> -	struct kthread_worker* worker;
> +	struct kthread_worker *worker;
>  	atomic_t refcount;
>  };
>  

This is fine ...

> @@ -2556,10 +2556,6 @@ static void nvme_remove(struct pci_dev *pdev)
>  {
>  	struct nvme_dev *dev = pci_get_drvdata(pdev);
>  
> -	spin_lock(&dev_list_lock);
> -	list_del_init(&dev->node);
> -	spin_unlock(&dev_list_lock);
> -
>  	pci_set_drvdata(pdev, NULL);
>  	flush_work(&dev->reset_work);
>  	misc_deregister(&dev->miscdev);

I'm going to let Keith comment on this before I apply it.  There might
be a reason to delete the node from the dev_list here before we call
flush_work().

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
@ 2013-12-30 13:46   ` Matthew Wilcox
  2013-12-30 13:48   ` Matias Bjorling
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2013-12-30 13:46 UTC (permalink / raw)


On Mon, Dec 30, 2013@03:57:17PM +0530, Santosh Y wrote:
> +config BLK_DEV_NVME_HP
> +	bool "Enable hotplug support"
> +	depends on BLK_DEV_NVME && HOTPLUG_PCI_PCIE
> +	default n
> +	help
> +	  If you say Y here, the driver will support hotplug feature.
> +	  Hotplug only works if the PCIe slot is hotplug capable.
> +

No.  There is no such thing as "enable hotplug support".  All devices
are at least theoretically hotpluggable, and I'm not papering over bugs
with this kind of config option.

> @@ -383,10 +403,19 @@ static void bio_completion(struct nvme_dev *dev, void *ctx,
>  		nvme_end_io_acct(bio, iod->start_time);
>  	}
>  	nvme_free_iod(dev, iod);
> -	if (status)
> -		bio_endio(bio, -EIO);
> -	else
> +	if (status) {
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if ((status & 0xff) == NVME_SC_INVALID_NS) {
> +			bio_endio(bio, -ENODEV);
> +		} else if ((status & 0xff) == NVME_SC_NS_NOT_READY) {
> +			bio->bi_rw |= REQ_FAILFAST_DRIVER;

Umm.
        __REQ_FAILFAST_DRIVER,  /* no driver retries of driver errors */
You seem to be using this to mean the exact opposite, "Do a retry".

> +			nvme_requeue_bio(dev, bio);
> +		} else
> +#endif
> +			bio_endio(bio, -EIO);
> +	} else {
>  		bio_endio(bio, 0);
> +	}
>  }
>  
>  /* length is in bytes.  gfp flags indicates whether we may sleep. */
> @@ -722,6 +751,10 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
>  	if ((bio->bi_rw & REQ_FLUSH) && !psegs)
>  		return nvme_submit_flush(nvmeq, ns, cmdid);
>  
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (bio->bi_rw & REQ_FAILFAST_DRIVER)
> +		mdelay(100);
> +#endif
>  	control = 0;
>  	if (bio->bi_rw & REQ_FUA)
>  		control |= NVME_RW_FUA;
> @@ -814,10 +847,26 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
>  
>  static void nvme_make_request(struct request_queue *q, struct bio *bio)
>  {
> -	struct nvme_ns *ns = q->queuedata;
> -	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> +	struct nvme_ns *ns = NULL;
> +	struct nvme_queue *nvmeq = NULL;
>  	int result = -EBUSY;
>  
> +	if (likely(q && q->queuedata))
> +		ns = q->queuedata;
> +	if (unlikely(!ns)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}

This confuses me.  You just checked that q->queuedata was != NULL, now
you're checking that it still wasn't NULL with no barriers or anything
between.  What are you trying to guard against here?

> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +		!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}
> +#endif
> +	nvmeq = get_nvmeq(ns->dev);
> +
>  	if (!nvmeq) {
>  		put_nvmeq(NULL);
>  		bio_endio(bio, -EIO);
> @@ -1120,6 +1169,12 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
>  			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
>  		};
>  
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &nvmeq->dev->hp_flag)) {
> +			cqe.status |= (NVME_SC_INVALID_NS << 1);
> +			info[cmdid].timeout = jiffies - NVME_IO_TIMEOUT;
> +		}
> +#endif
>  		if (timeout && !time_after(now, info[cmdid].timeout))
>  			continue;
>  		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
> @@ -1205,7 +1260,7 @@ static void nvme_disable_queue(struct nvme_dev *dev, int qid)
>  
>  	/* Don't tell the adapter to delete the admin queue.
>  	 * Don't tell a removed adapter to delete IO queues. */
> -	if (qid && readl(&dev->bar->csts) != -1) {
> +	if (qid && !nvme_check_surprise_removal(dev)) {

This isn't really "check surprise removal", it's "nvme_is_present(dev)".

>  		adapter_delete_sq(dev, qid);
>  		adapter_delete_cq(dev, qid);
>  	}
> @@ -1724,6 +1779,13 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
>  		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
>  		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
>  
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +			!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +			bio_endio(bio, -ENODEV);
> +			continue;
> +		}
> +#endif
>  		if (bio_list_empty(&nvmeq->sq_cong))
>  			remove_wait_queue(&nvmeq->sq_full,
>  							&nvmeq->sq_cong_wait);
> @@ -1746,7 +1808,8 @@ static int nvme_kthread(void *data)
>  		spin_lock(&dev_list_lock);
>  		list_for_each_entry_safe(dev, next, &dev_list, node) {
>  			int i;
> -			if (readl(&dev->bar->csts) & NVME_CSTS_CFS &&
> +			if (!nvme_check_surprise_removal(dev) &&
> +				readl(&dev->bar->csts) & NVME_CSTS_CFS &&
>  							dev->initialized) {
>  				if (work_busy(&dev->reset_work))
>  					continue;
> @@ -2082,7 +2145,7 @@ static int nvme_dev_map(struct nvme_dev *dev)
>  	dev->bar = ioremap(pci_resource_start(pdev, 0), 8192);
>  	if (!dev->bar)
>  		goto disable;
> -	if (readl(&dev->bar->csts) == -1) {
> +	if (nvme_check_surprise_removal(dev)) {
>  		result = -ENODEV;
>  		goto unmap;
>  	}
> @@ -2265,7 +2328,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
>  	list_del_init(&dev->node);
>  	spin_unlock(&dev_list_lock);
>  
> -	if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
> +	if (!dev->bar || (dev->bar && nvme_check_surprise_removal(dev))) {
>  		for (i = dev->queue_count - 1; i >= 0; i--) {
>  			struct nvme_queue *nvmeq = dev->queues[i];
>  			nvme_suspend_queue(nvmeq);
> @@ -2534,6 +2597,12 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  
>  	dev->initialized = 1;
>  	kref_init(&dev->kref);
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev->is_added)
> +		dev_info(&pdev->dev,
> +			"Device 0x%x is on-line\n", pdev->device);
> +#endif
>  	return 0;
>  
>   remove:
> @@ -2556,6 +2625,16 @@ static void nvme_remove(struct pci_dev *pdev)
>  {
>  	struct nvme_dev *dev = pci_get_drvdata(pdev);
>  
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev || !dev)
> +		return;

!pdev can't possibly happen, or we crashed on the previous line.

> +	if (nvme_check_surprise_removal(dev)) {
> +		set_bit(NVME_HOT_REM, &dev->hp_flag);
> +		dev_info(&pdev->dev,
> +			"Surprise removal of device 0x%x\n", pdev->device);
> +	}
> +	pci_dev_get(pdev);
> +#endif
>  	pci_set_drvdata(pdev, NULL);
>  	flush_work(&dev->reset_work);
>  	misc_deregister(&dev->miscdev);
> @@ -2565,6 +2644,9 @@ static void nvme_remove(struct pci_dev *pdev)
>  	nvme_release_instance(dev);
>  	nvme_release_prp_pools(dev);
>  	kref_put(&dev->kref, nvme_free_dev);
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	pci_dev_put(pdev);

I'm pretty sure you don't need to bump the refcount up and down during
this function.  Look at the caller, pci_stop_dev().  That dereferences
the pci_dev after calling ->remove.  Any bug you could possibly fix by
playing with the reference count here is only going to be hit there when
it sets is_added to 0.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
  2013-12-30 13:46   ` Matthew Wilcox
@ 2013-12-30 13:48   ` Matias Bjorling
  2013-12-30 14:09   ` Matias Bjorling
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Matias Bjorling @ 2013-12-30 13:48 UTC (permalink / raw)


On 12/30/2013 11:27 AM, Santosh Y wrote:
> This patch provides basic hotplug support for NVMe.
> For NVMe hotplug to work the PCIe slot must be hotplug capable.
>
> When a NVMe device is surprise removed and inserted back the
> device may need some time to respond to host IO commands, and
> will return NVME_SC_NS_NOT_READY. In this case the requests
> will be requeued until the device responds to the IO commands
> with status response or until the commands timeout.
>
> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 86b9f37..f92ec96 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -319,6 +319,14 @@ config BLK_DEV_NVME
>   	  To compile this driver as a module, choose M here: the
>   	  module will be called nvme.
>
> +config BLK_DEV_NVME_HP
> +	bool "Enable hotplug support"
> +	depends on BLK_DEV_NVME && HOTPLUG_PCI_PCIE
> +	default n
> +	help
> +	  If you say Y here, the driver will support hotplug feature.


> +	  Hotplug only works if the PCIe slot is hotplug capable.
> +
>   config BLK_DEV_SKD
>   	tristate "STEC S1120 Block Driver"
>   	depends on PCI
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index a523296..8a02135 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -172,6 +172,13 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
>   	return cmdid;
>   }
>
> +static inline bool nvme_check_surprise_removal(struct nvme_dev *dev)
> +{
> +	if (readl(&dev->bar->csts) == -1)
> +		return true;
> +	return false;
> +}
> +
>   static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
>   				nvme_completion_fn handler, unsigned timeout)
>   {
> @@ -370,6 +377,19 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
>   	part_stat_unlock();
>   }
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP

There are an excessive amount of ifdef/endif declarations. Instead of 
special case around each chunk hotplug code. What about refactoring it 
into its own nvme-hotpull.c file?

This will clean the code path for hotplugging and won't interfere with 
the natural flow of the nvme core code. Another is to just bite the 
complexity and don't ifdef around hotplug areas.


> +static void nvme_requeue_bio(struct nvme_dev *dev, struct bio *bio)
> +{
> +	struct nvme_queue *nvmeq = get_nvmeq(dev);
> +	if (bio_list_empty(&nvmeq->sq_cong))
> +		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
> +	bio_list_add(&nvmeq->sq_cong, bio);
> +	put_nvmeq(nvmeq);
> +	wake_up_process(nvme_thread);
> +}
> +#endif
> +
> +
>   static void bio_completion(struct nvme_dev *dev, void *ctx,
>   						struct nvme_completion *cqe)
>   {
> @@ -383,10 +403,19 @@ static void bio_completion(struct nvme_dev *dev, void *ctx,
>   		nvme_end_io_acct(bio, iod->start_time);
>   	}
>   	nvme_free_iod(dev, iod);
> -	if (status)
> -		bio_endio(bio, -EIO);
> -	else
> +	if (status) {
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if ((status & 0xff) == NVME_SC_INVALID_NS) {
> +			bio_endio(bio, -ENODEV);
> +		} else if ((status & 0xff) == NVME_SC_NS_NOT_READY) {
> +			bio->bi_rw |= REQ_FAILFAST_DRIVER;
> +			nvme_requeue_bio(dev, bio);
> +		} else
> +#endif
> +			bio_endio(bio, -EIO);
> +	} else {
>   		bio_endio(bio, 0);
> +	}
>   }
>
>   /* length is in bytes.  gfp flags indicates whether we may sleep. */
> @@ -722,6 +751,10 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
>   	if ((bio->bi_rw & REQ_FLUSH) && !psegs)
>   		return nvme_submit_flush(nvmeq, ns, cmdid);
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (bio->bi_rw & REQ_FAILFAST_DRIVER)
> +		mdelay(100);
> +#endif
>   	control = 0;
>   	if (bio->bi_rw & REQ_FUA)
>   		control |= NVME_RW_FUA;
> @@ -814,10 +847,26 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
>
>   static void nvme_make_request(struct request_queue *q, struct bio *bio)
>   {
> -	struct nvme_ns *ns = q->queuedata;
> -	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> +	struct nvme_ns *ns = NULL;
> +	struct nvme_queue *nvmeq = NULL;
>   	int result = -EBUSY;
>
> +	if (likely(q && q->queuedata))
> +		ns = q->queuedata;
> +	if (unlikely(!ns)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +		!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}
> +#endif
> +	nvmeq = get_nvmeq(ns->dev);
> +
>   	if (!nvmeq) {
>   		put_nvmeq(NULL);
>   		bio_endio(bio, -EIO);
> @@ -1120,6 +1169,12 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
>   			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
>   		};
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &nvmeq->dev->hp_flag)) {
> +			cqe.status |= (NVME_SC_INVALID_NS << 1);
> +			info[cmdid].timeout = jiffies - NVME_IO_TIMEOUT;
> +		}
> +#endif
>   		if (timeout && !time_after(now, info[cmdid].timeout))
>   			continue;
>   		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
> @@ -1205,7 +1260,7 @@ static void nvme_disable_queue(struct nvme_dev *dev, int qid)
>
>   	/* Don't tell the adapter to delete the admin queue.
>   	 * Don't tell a removed adapter to delete IO queues. */
> -	if (qid && readl(&dev->bar->csts) != -1) {
> +	if (qid && !nvme_check_surprise_removal(dev)) {
>   		adapter_delete_sq(dev, qid);
>   		adapter_delete_cq(dev, qid);
>   	}
> @@ -1724,6 +1779,13 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
>   		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
>   		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +			!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +			bio_endio(bio, -ENODEV);
> +			continue;
> +		}
> +#endif
>   		if (bio_list_empty(&nvmeq->sq_cong))
>   			remove_wait_queue(&nvmeq->sq_full,
>   							&nvmeq->sq_cong_wait);
> @@ -1746,7 +1808,8 @@ static int nvme_kthread(void *data)
>   		spin_lock(&dev_list_lock);
>   		list_for_each_entry_safe(dev, next, &dev_list, node) {
>   			int i;
> -			if (readl(&dev->bar->csts) & NVME_CSTS_CFS &&
> +			if (!nvme_check_surprise_removal(dev) &&
> +				readl(&dev->bar->csts) & NVME_CSTS_CFS &&
>   							dev->initialized) {
>   				if (work_busy(&dev->reset_work))
>   					continue;
> @@ -2082,7 +2145,7 @@ static int nvme_dev_map(struct nvme_dev *dev)
>   	dev->bar = ioremap(pci_resource_start(pdev, 0), 8192);
>   	if (!dev->bar)
>   		goto disable;
> -	if (readl(&dev->bar->csts) == -1) {
> +	if (nvme_check_surprise_removal(dev)) {
>   		result = -ENODEV;
>   		goto unmap;
>   	}
> @@ -2265,7 +2328,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
>   	list_del_init(&dev->node);
>   	spin_unlock(&dev_list_lock);
>
> -	if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
> +	if (!dev->bar || (dev->bar && nvme_check_surprise_removal(dev))) {
>   		for (i = dev->queue_count - 1; i >= 0; i--) {
>   			struct nvme_queue *nvmeq = dev->queues[i];
>   			nvme_suspend_queue(nvmeq);
> @@ -2534,6 +2597,12 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>
>   	dev->initialized = 1;
>   	kref_init(&dev->kref);
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev->is_added)
> +		dev_info(&pdev->dev,
> +			"Device 0x%x is on-line\n", pdev->device);
> +#endif
>   	return 0;
>
>    remove:
> @@ -2556,6 +2625,16 @@ static void nvme_remove(struct pci_dev *pdev)
>   {
>   	struct nvme_dev *dev = pci_get_drvdata(pdev);
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev || !dev)
> +		return;
> +	if (nvme_check_surprise_removal(dev)) {
> +		set_bit(NVME_HOT_REM, &dev->hp_flag);
> +		dev_info(&pdev->dev,
> +			"Surprise removal of device 0x%x\n", pdev->device);
> +	}
> +	pci_dev_get(pdev);
> +#endif
>   	pci_set_drvdata(pdev, NULL);
>   	flush_work(&dev->reset_work);
>   	misc_deregister(&dev->miscdev);
> @@ -2565,6 +2644,9 @@ static void nvme_remove(struct pci_dev *pdev)
>   	nvme_release_instance(dev);
>   	nvme_release_prp_pools(dev);
>   	kref_put(&dev->kref, nvme_free_dev);
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	pci_dev_put(pdev);
> +#endif
>   }
>
>   /* These functions are yet to be implemented */
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 69ae03f..4ef375e 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -68,6 +68,12 @@ enum {
>
>   #define NVME_IO_TIMEOUT	(5 * HZ)
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +enum {
> +	NVME_HOT_REM,
> +};
> +#endif
> +
>   /*
>    * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>    */
> @@ -97,6 +103,9 @@ struct nvme_dev {
>   	u16 oncs;
>   	u16 abort_limit;
>   	u8 initialized;
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	unsigned long hp_flag;
> +#endif
>   };
>
>   /*
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2013-12-30 10:27 ` [PATCH RFC 3/5] NVMe: Asynchronous device scan support Santosh Y
@ 2013-12-30 13:50   ` Matthew Wilcox
  2013-12-30 15:55     ` Keith Busch
  2014-03-28 14:02     ` Santosh Y
  0 siblings, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2013-12-30 13:50 UTC (permalink / raw)


On Mon, Dec 30, 2013@03:57:18PM +0530, Santosh Y wrote:
> This patch provides asynchronous device enumeration
> capability. The 'probe' need not wait until the namespace scanning is
> complete.

I'm very interested in having something like this, except I don't think
it's complete.  You don't seem to handle the cases where the device
is shut down in the middle of an async scan.  Also, the only piece you
seem to make async is the calls to add_disk(), which are surely not the
timeconsuming parts of the scan.  I would think the time consuming parts
are sending the IDENTIFY commands, which you don't make async.

> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
> Signed-off-by: Santosh Y <santoshsy at gmail.com>
> 
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index 8a02135..cd37335 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -17,6 +17,9 @@
>   */
>  
>  #include <linux/nvme.h>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +#include <linux/async.h>
> +#endif
>  #include <linux/bio.h>
>  #include <linux/bitops.h>
>  #include <linux/blkdev.h>
> @@ -2115,8 +2118,10 @@ static int nvme_dev_add(struct nvme_dev *dev)
>  		if (ns)
>  			list_add_tail(&ns->list, &dev->namespaces);
>  	}
> +#ifndef CONFIG_BLK_DEV_NVME_HP
>  	list_for_each_entry(ns, &dev->namespaces, list)
>  		add_disk(ns->disk);
> +#endif
>  	res = 0;
>  
>   out:
> @@ -2546,6 +2551,19 @@ static void nvme_reset_failed_dev(struct work_struct *ws)
>  	nvme_dev_reset(dev);
>  }
>  
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +static void nvme_async_add(void *data, async_cookie_t cookie)
> +{
> +	struct nvme_dev *dev = (struct nvme_dev *)data;
> +	struct nvme_ns *ns;
> +
> +	list_for_each_entry(ns, &dev->namespaces, list)
> +		add_disk(ns->disk);
> +	if (!test_bit(NVME_HOT_REM, &dev->hp_flag))
> +		dev->initialized = 1;
> +}
> +#endif
> +
>  static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	int result = -ENOMEM;
> @@ -2595,14 +2613,16 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (result)
>  		goto remove;
>  
> -	dev->initialized = 1;
> -	kref_init(&dev->kref);
> -
>  #ifdef CONFIG_BLK_DEV_NVME_HP
> +	async_schedule(nvme_async_add, dev);
>  	if (!pdev->is_added)
>  		dev_info(&pdev->dev,
>  			"Device 0x%x is on-line\n", pdev->device);
> +#else
> +	dev->initialized = 1;
>  #endif
> +	kref_init(&dev->kref);
> +
>  	return 0;
>  
>   remove:
> -- 
> 1.8.3.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 4/5] NVMe: Stale node cleanup based on reference count
  2013-12-30 10:27 ` [PATCH RFC 4/5] NVMe: Stale node cleanup based on reference count Santosh Y
@ 2013-12-30 14:00   ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2013-12-30 14:00 UTC (permalink / raw)


On Mon, Dec 30, 2013@03:57:19PM +0530, Santosh Y wrote:
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +#define NVME_MINORS 64
> +#endif

Why does this need to go back for this patch?

> +static int nvme_bd_open(struct block_device *bdev, fmode_t mode)
> +{
> +	struct nvme_ns *ns;
> +	int err = -ENODEV;
> +
> +	if (!bdev || !bdev->bd_disk ||
> +		!bdev->bd_disk->private_data)
> +		goto out;

!bdev?  Do you really think the core is giong to call you with a NULL bdev?
I can see that bd_disk isn't going to be NULL either.

> @@ -1805,6 +1853,9 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
>  static int nvme_kthread(void *data)
>  {
>  	struct nvme_dev *dev, *next;
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	unsigned long flags  = 0;
> +#endif

You don't need to initialise 'flags' that are used as part of an
irqsave/restore.

> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +				spin_lock_irqsave(&nvmeq->q_lock, flags);
> +#else
>  				spin_lock_irq(&nvmeq->q_lock);
> +#endif

... but *what on earth* is going on here?!  How is the kthread getting
run in interrupt context?

>  static void nvme_free_dev(struct kref *kref)
>  {
>  	struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);
>  	kfree(dev->queues);
>  	kfree(dev->entry);
> -	kfree(dev);
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!test_bit(NVME_STALE_NODE, &dev->hp_flag))
> +#endif
> +		kfree(dev);
>  }
>  
>  static int nvme_dev_open(struct inode *inode, struct file *f)
> @@ -2431,6 +2584,11 @@ static int nvme_dev_open(struct inode *inode, struct file *f)
>  static int nvme_dev_release(struct inode *inode, struct file *f)
>  {
>  	struct nvme_dev *dev = f->private_data;
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!dev)
> +		return -ENODEV;
> +#endif
>  	kref_put(&dev->kref, nvme_free_dev);
>  	return 0;
>  }

This seems like you don't understand the kref pattern.  Why do you have
this 'stale' list rather than just incrementing the kref, and delaying
the final kfree until you're done with your usage?

> @@ -2438,6 +2596,11 @@ static int nvme_dev_release(struct inode *inode, struct file *f)
>  static long nvme_dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
>  {
>  	struct nvme_dev *dev = f->private_data;
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!dev)
> +		return -ENODEV;
> +#endif

How is it that f->private_data can become NULL?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
  2013-12-30 13:46   ` Matthew Wilcox
  2013-12-30 13:48   ` Matias Bjorling
@ 2013-12-30 14:09   ` Matias Bjorling
  2013-12-30 16:06   ` Keith Busch
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Matias Bjorling @ 2013-12-30 14:09 UTC (permalink / raw)


On 12/30/2013 11:27 AM, Santosh Y wrote:
> This patch provides basic hotplug support for NVMe.
> For NVMe hotplug to work the PCIe slot must be hotplug capable.
>
> When a NVMe device is surprise removed and inserted back the
> device may need some time to respond to host IO commands, and
> will return NVME_SC_NS_NOT_READY. In this case the requests
> will be requeued until the device responds to the IO commands
> with status response or until the commands timeout.
>
> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 86b9f37..f92ec96 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -319,6 +319,14 @@ config BLK_DEV_NVME
>   	  To compile this driver as a module, choose M here: the
>   	  module will be called nvme.
>
> +config BLK_DEV_NVME_HP
> +	bool "Enable hotplug support"
> +	depends on BLK_DEV_NVME && HOTPLUG_PCI_PCIE
> +	default n
> +	help
> +	  If you say Y here, the driver will support hotplug feature.
> +	  Hotplug only works if the PCIe slot is hotplug capable.
> +
>   config BLK_DEV_SKD
>   	tristate "STEC S1120 Block Driver"
>   	depends on PCI
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index a523296..8a02135 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -172,6 +172,13 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
>   	return cmdid;
>   }
>
> +static inline bool nvme_check_surprise_removal(struct nvme_dev *dev)
> +{
> +	if (readl(&dev->bar->csts) == -1)
> +		return true;
> +	return false;

Just return (readl(&dev->bar->csts) == -1);

> +}
> +
>   static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
>   				nvme_completion_fn handler, unsigned timeout)
>   {
> @@ -370,6 +377,19 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
>   	part_stat_unlock();
>   }
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +static void nvme_requeue_bio(struct nvme_dev *dev, struct bio *bio)
> +{
> +	struct nvme_queue *nvmeq = get_nvmeq(dev);
> +	if (bio_list_empty(&nvmeq->sq_cong))
> +		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
> +	bio_list_add(&nvmeq->sq_cong, bio);
> +	put_nvmeq(nvmeq);
> +	wake_up_process(nvme_thread);
> +}
> +#endif
> +
> +
>   static void bio_completion(struct nvme_dev *dev, void *ctx,
>   						struct nvme_completion *cqe)
>   {
> @@ -383,10 +403,19 @@ static void bio_completion(struct nvme_dev *dev, void *ctx,
>   		nvme_end_io_acct(bio, iod->start_time);
>   	}
>   	nvme_free_iod(dev, iod);
> -	if (status)
> -		bio_endio(bio, -EIO);
> -	else
> +	if (status) {
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if ((status & 0xff) == NVME_SC_INVALID_NS) {
> +			bio_endio(bio, -ENODEV);
> +		} else if ((status & 0xff) == NVME_SC_NS_NOT_READY) {
> +			bio->bi_rw |= REQ_FAILFAST_DRIVER;
> +			nvme_requeue_bio(dev, bio);
> +		} else
> +#endif
> +			bio_endio(bio, -EIO);
> +	} else {
>   		bio_endio(bio, 0);
> +	}
>   }
>
>   /* length is in bytes.  gfp flags indicates whether we may sleep. */
> @@ -722,6 +751,10 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
>   	if ((bio->bi_rw & REQ_FLUSH) && !psegs)
>   		return nvme_submit_flush(nvmeq, ns, cmdid);
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (bio->bi_rw & REQ_FAILFAST_DRIVER)
> +		mdelay(100);

Why delay 100?

> +#endif
>   	control = 0;
>   	if (bio->bi_rw & REQ_FUA)
>   		control |= NVME_RW_FUA;
> @@ -814,10 +847,26 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
>
>   static void nvme_make_request(struct request_queue *q, struct bio *bio)
>   {
> -	struct nvme_ns *ns = q->queuedata;
> -	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> +	struct nvme_ns *ns = NULL;
> +	struct nvme_queue *nvmeq = NULL;
>   	int result = -EBUSY;
>
> +	if (likely(q && q->queuedata))
> +		ns = q->queuedata;

When are q and q->queuedata invalid?

> +	if (unlikely(!ns)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +		!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}
> +#endif
> +	nvmeq = get_nvmeq(ns->dev);
> +
>   	if (!nvmeq) {
>   		put_nvmeq(NULL);
>   		bio_endio(bio, -EIO);
> @@ -1120,6 +1169,12 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
>   			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
>   		};
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &nvmeq->dev->hp_flag)) {
> +			cqe.status |= (NVME_SC_INVALID_NS << 1);
> +			info[cmdid].timeout = jiffies - NVME_IO_TIMEOUT;
> +		}
> +#endif
>   		if (timeout && !time_after(now, info[cmdid].timeout))
>   			continue;
>   		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
> @@ -1205,7 +1260,7 @@ static void nvme_disable_queue(struct nvme_dev *dev, int qid)
>
>   	/* Don't tell the adapter to delete the admin queue.
>   	 * Don't tell a removed adapter to delete IO queues. */
> -	if (qid && readl(&dev->bar->csts) != -1) {
> +	if (qid && !nvme_check_surprise_removal(dev)) {
>   		adapter_delete_sq(dev, qid);
>   		adapter_delete_cq(dev, qid);
>   	}
> @@ -1724,6 +1779,13 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
>   		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
>   		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +			!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +			bio_endio(bio, -ENODEV);
> +			continue;
> +		}
> +#endif
>   		if (bio_list_empty(&nvmeq->sq_cong))
>   			remove_wait_queue(&nvmeq->sq_full,
>   							&nvmeq->sq_cong_wait);
> @@ -1746,7 +1808,8 @@ static int nvme_kthread(void *data)
>   		spin_lock(&dev_list_lock);
>   		list_for_each_entry_safe(dev, next, &dev_list, node) {
>   			int i;
> -			if (readl(&dev->bar->csts) & NVME_CSTS_CFS &&
> +			if (!nvme_check_surprise_removal(dev) &&
> +				readl(&dev->bar->csts) & NVME_CSTS_CFS &&
>   							dev->initialized) {
>   				if (work_busy(&dev->reset_work))
>   					continue;
> @@ -2082,7 +2145,7 @@ static int nvme_dev_map(struct nvme_dev *dev)
>   	dev->bar = ioremap(pci_resource_start(pdev, 0), 8192);
>   	if (!dev->bar)
>   		goto disable;
> -	if (readl(&dev->bar->csts) == -1) {
> +	if (nvme_check_surprise_removal(dev)) {
>   		result = -ENODEV;
>   		goto unmap;
>   	}
> @@ -2265,7 +2328,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
>   	list_del_init(&dev->node);
>   	spin_unlock(&dev_list_lock);
>
> -	if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
> +	if (!dev->bar || (dev->bar && nvme_check_surprise_removal(dev))) {
>   		for (i = dev->queue_count - 1; i >= 0; i--) {
>   			struct nvme_queue *nvmeq = dev->queues[i];
>   			nvme_suspend_queue(nvmeq);
> @@ -2534,6 +2597,12 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>
>   	dev->initialized = 1;
>   	kref_init(&dev->kref);
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev->is_added)
> +		dev_info(&pdev->dev,
> +			"Device 0x%x is on-line\n", pdev->device);
> +#endif
>   	return 0;
>
>    remove:
> @@ -2556,6 +2625,16 @@ static void nvme_remove(struct pci_dev *pdev)
>   {
>   	struct nvme_dev *dev = pci_get_drvdata(pdev);
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev || !dev)
> +		return;
> +	if (nvme_check_surprise_removal(dev)) {
> +		set_bit(NVME_HOT_REM, &dev->hp_flag);
> +		dev_info(&pdev->dev,
> +			"Surprise removal of device 0x%x\n", pdev->device);
> +	}
> +	pci_dev_get(pdev);
> +#endif
>   	pci_set_drvdata(pdev, NULL);
>   	flush_work(&dev->reset_work);
>   	misc_deregister(&dev->miscdev);
> @@ -2565,6 +2644,9 @@ static void nvme_remove(struct pci_dev *pdev)
>   	nvme_release_instance(dev);
>   	nvme_release_prp_pools(dev);
>   	kref_put(&dev->kref, nvme_free_dev);
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	pci_dev_put(pdev);
> +#endif
>   }
>
>   /* These functions are yet to be implemented */
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 69ae03f..4ef375e 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -68,6 +68,12 @@ enum {
>
>   #define NVME_IO_TIMEOUT	(5 * HZ)
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +enum {
> +	NVME_HOT_REM,
> +};

This could be put together with the enum holding 
NVME_[QUEUE,CQ,SQ,FEAT,FWACT]. Also, be explicit about the value.

> +#endif
> +
>   /*
>    * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>    */
> @@ -97,6 +103,9 @@ struct nvme_dev {
>   	u16 oncs;
>   	u16 abort_limit;
>   	u8 initialized;
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	unsigned long hp_flag;

Couldn't this be moved into a generic flag int or stored within an 
existing value?

> +#endif
>   };
>
>   /*
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction
  2013-12-30 10:27 ` [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction Santosh Y
  2013-12-30 13:32   ` Matthew Wilcox
@ 2013-12-30 14:52   ` Keith Busch
  1 sibling, 0 replies; 23+ messages in thread
From: Keith Busch @ 2013-12-30 14:52 UTC (permalink / raw)


On Mon, 30 Dec 2013, Santosh Y wrote:
> Remove redunduant 'dev->node' deletion which is being
> handled in nvme_dev_shutdown() and a minor checkpatch error.
>
> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index b59a93a..a523296 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -2120,7 +2120,7 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
>
> struct nvme_delq_ctx {
> 	struct task_struct *waiter;
> -	struct kthread_worker* worker;
> +	struct kthread_worker *worker;
> 	atomic_t refcount;
> };
>
> @@ -2556,10 +2556,6 @@ static void nvme_remove(struct pci_dev *pdev)
> {
> 	struct nvme_dev *dev = pci_get_drvdata(pdev);
>
> -	spin_lock(&dev_list_lock);
> -	list_del_init(&dev->node);
> -	spin_unlock(&dev_list_lock);
> -
> 	pci_set_drvdata(pdev, NULL);
> 	flush_work(&dev->reset_work);
> 	misc_deregister(&dev->miscdev);

We have to delete it from the list here to prevent a surprise removed
device from being polled after the pci layer calls the driver's
'remove'. Work can not be queued if the device is not polled, so we have
to remove from the list before calling flush_work, otherwise work could
potentially be queued after this point, and that's not okay.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2013-12-30 13:50   ` Matthew Wilcox
@ 2013-12-30 15:55     ` Keith Busch
  2014-03-28 14:02     ` Santosh Y
  1 sibling, 0 replies; 23+ messages in thread
From: Keith Busch @ 2013-12-30 15:55 UTC (permalink / raw)


On Mon, 30 Dec 2013, Matthew Wilcox wrote:
> On Mon, Dec 30, 2013@03:57:18PM +0530, Santosh Y wrote:
>> This patch provides asynchronous device enumeration
>> capability. The 'probe' need not wait until the namespace scanning is
>> complete.
>
> I'm very interested in having something like this, except I don't think
> it's complete.  You don't seem to handle the cases where the device
> is shut down in the middle of an async scan.  Also, the only piece you
> seem to make async is the calls to add_disk(), which are surely not the
> timeconsuming parts of the scan.  I would think the time consuming parts
> are sending the IDENTIFY commands, which you don't make async.

Surprisingly, add_disk is the by far longest part. It does several dozen
blocking reads to check for a known partitions. The identifies don't
become measurable until you have several hundred namespaces but that
time is just noise compared to 'add_disk'.

I started on a generic block patch that makes 'add_disk' non-blocking. I
can see about cleaning that up and posting to the block mailing list
for consideration.

>> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
>> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>>
>> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
>> index 8a02135..cd37335 100644
>> --- a/drivers/block/nvme-core.c
>> +++ b/drivers/block/nvme-core.c
>> @@ -17,6 +17,9 @@
>>   */
>>
>>  #include <linux/nvme.h>
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +#include <linux/async.h>
>> +#endif
>>  #include <linux/bio.h>
>>  #include <linux/bitops.h>
>>  #include <linux/blkdev.h>
>> @@ -2115,8 +2118,10 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>  		if (ns)
>>  			list_add_tail(&ns->list, &dev->namespaces);
>>  	}
>> +#ifndef CONFIG_BLK_DEV_NVME_HP
>>  	list_for_each_entry(ns, &dev->namespaces, list)
>>  		add_disk(ns->disk);
>> +#endif
>>  	res = 0;
>>
>>   out:
>> @@ -2546,6 +2551,19 @@ static void nvme_reset_failed_dev(struct work_struct *ws)
>>  	nvme_dev_reset(dev);
>>  }
>>
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +static void nvme_async_add(void *data, async_cookie_t cookie)
>> +{
>> +	struct nvme_dev *dev = (struct nvme_dev *)data;
>> +	struct nvme_ns *ns;
>> +
>> +	list_for_each_entry(ns, &dev->namespaces, list)
>> +		add_disk(ns->disk);
>> +	if (!test_bit(NVME_HOT_REM, &dev->hp_flag))
>> +		dev->initialized = 1;
>> +}
>> +#endif
>> +
>>  static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>  {
>>  	int result = -ENOMEM;
>> @@ -2595,14 +2613,16 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>  	if (result)
>>  		goto remove;
>>
>> -	dev->initialized = 1;
>> -	kref_init(&dev->kref);
>> -
>>  #ifdef CONFIG_BLK_DEV_NVME_HP
>> +	async_schedule(nvme_async_add, dev);
>>  	if (!pdev->is_added)
>>  		dev_info(&pdev->dev,
>>  			"Device 0x%x is on-line\n", pdev->device);
>> +#else
>> +	dev->initialized = 1;
>>  #endif
>> +	kref_init(&dev->kref);
>> +
>>  	return 0;
>>
>>   remove:
>> --
>> 1.8.3.2
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://merlin.infradead.org/mailman/listinfo/linux-nvme
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
                     ` (2 preceding siblings ...)
  2013-12-30 14:09   ` Matias Bjorling
@ 2013-12-30 16:06   ` Keith Busch
  2013-12-30 17:21   ` Keith Busch
  2013-12-31 13:35   ` Matthew Wilcox
  5 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2013-12-30 16:06 UTC (permalink / raw)


On Mon, 30 Dec 2013, Santosh Y wrote:
> This patch provides basic hotplug support for NVMe.
> For NVMe hotplug to work the PCIe slot must be hotplug capable.
>
> When a NVMe device is surprise removed and inserted back the
> device may need some time to respond to host IO commands, and
> will return NVME_SC_NS_NOT_READY. In this case the requests
> will be requeued until the device responds to the IO commands
> with status response or until the commands timeout.
>
> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 86b9f37..f92ec96 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -319,6 +319,14 @@ config BLK_DEV_NVME
> 	  To compile this driver as a module, choose M here: the
> 	  module will be called nvme.
>
> +config BLK_DEV_NVME_HP
> +	bool "Enable hotplug support"
> +	depends on BLK_DEV_NVME && HOTPLUG_PCI_PCIE
> +	default n
> +	help
> +	  If you say Y here, the driver will support hotplug feature.
> +	  Hotplug only works if the PCIe slot is hotplug capable.
> +
> config BLK_DEV_SKD
> 	tristate "STEC S1120 Block Driver"
> 	depends on PCI
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index a523296..8a02135 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -172,6 +172,13 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
> 	return cmdid;
> }
>
> +static inline bool nvme_check_surprise_removal(struct nvme_dev *dev)
> +{
> +	if (readl(&dev->bar->csts) == -1)
> +		return true;
> +	return false;
> +}
> +
> static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
> 				nvme_completion_fn handler, unsigned timeout)
> {
> @@ -370,6 +377,19 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
> 	part_stat_unlock();
> }
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +static void nvme_requeue_bio(struct nvme_dev *dev, struct bio *bio)
> +{
> +	struct nvme_queue *nvmeq = get_nvmeq(dev);
> +	if (bio_list_empty(&nvmeq->sq_cong))
> +		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
> +	bio_list_add(&nvmeq->sq_cong, bio);
> +	put_nvmeq(nvmeq);
> +	wake_up_process(nvme_thread);
> +}
> +#endif

Unless you just so happen to be calling this function on the correct cpu,
you are modifying a bio_list with an unlocked nvme_queue, and that can
totally screw things up. I think we need to change the callbacks to take
'nvme_queue's instead of 'nvme_dev's to allow us to retry failed bios.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
                     ` (3 preceding siblings ...)
  2013-12-30 16:06   ` Keith Busch
@ 2013-12-30 17:21   ` Keith Busch
  2013-12-31  8:48     ` Ravi Kumar
  2013-12-31 13:35   ` Matthew Wilcox
  5 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2013-12-30 17:21 UTC (permalink / raw)


On Mon, 30 Dec 2013, Santosh Y wrote:
> This patch provides basic hotplug support for NVMe.
> For NVMe hotplug to work the PCIe slot must be hotplug capable.

I think the current code should handle hot-plug events just fine. I
would be very interested to know what you're doing that breaks this if
it is not working for you.

> When a NVMe device is surprise removed and inserted back the
> device may need some time to respond to host IO commands, and
> will return NVME_SC_NS_NOT_READY. In this case the requests
> will be requeued until the device responds to the IO commands
> with status response or until the commands timeout.
>

Does your device violate the spec? The driver can't send IO commands
until the controller sets CSTS.RDY to 1, and the controller should not
do that until it is ready to handle host commands.

> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 86b9f37..f92ec96 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig

[snip]

> static void nvme_make_request(struct request_queue *q, struct bio *bio)
> {
> -	struct nvme_ns *ns = q->queuedata;
> -	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> +	struct nvme_ns *ns = NULL;
> +	struct nvme_queue *nvmeq = NULL;
> 	int result = -EBUSY;
>
> +	if (likely(q && q->queuedata))
> +		ns = q->queuedata;
> +	if (unlikely(!ns)) {
> +		bio_endio(bio, -ENODEV);
> +		return;
> +	}
> +
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +		!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +		bio_endio(bio, -ENODEV);

I don't know if ENODEV is a valid error here. You can't get here unless
the device has already been opened, and it cann't be opened if the device
does not exist, so I think this is an incorrect errno to use. Maybe ENXIO?

But why is this check even necessary? Let's say the device has been
hot-removed. If the platform is truely capable of handling such
events, the pci layer will call the driver's 'remove', and it will suspend
all the queues, then cancel everything. New IO submitted at this point
will get put on the bio_list and be failed shortly after.

If you're platform is not capable of pcie hot-plug events, the kthread
reset handler will pick up on this since the CSTS.CFS bit will be set
and the device will be queued for reset, which will fail and trigger a
forced device removal. From there, it will look the same to the driver
as if it was called from the pci layer.

> +		return;
> +	}
> +#endif
> +	nvmeq = get_nvmeq(ns->dev);
> +
> 	if (!nvmeq) {
> 		put_nvmeq(NULL);
> 		bio_endio(bio, -EIO);
> @@ -1120,6 +1169,12 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
> 			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
> 		};
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &nvmeq->dev->hp_flag)) {
> +			cqe.status |= (NVME_SC_INVALID_NS << 1);
> +			info[cmdid].timeout = jiffies - NVME_IO_TIMEOUT;
> +		}
> +#endif

If the device is hot removed, the existing code will force cancel all
IOs, so this doesn't seem necessary. Also, OR'ing in invalid namespace
on top of abort status wouldn't be right.

> 		if (timeout && !time_after(now, info[cmdid].timeout))
> 			continue;
> 		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
> @@ -1205,7 +1260,7 @@ static void nvme_disable_queue(struct nvme_dev *dev, int qid)
>
> 	/* Don't tell the adapter to delete the admin queue.
> 	 * Don't tell a removed adapter to delete IO queues. */
> -	if (qid && readl(&dev->bar->csts) != -1) {
> +	if (qid && !nvme_check_surprise_removal(dev)) {
> 		adapter_delete_sq(dev, qid);
> 		adapter_delete_cq(dev, qid);
> 	}
> @@ -1724,6 +1779,13 @@ static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
> 		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
> 		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +		if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
> +			!(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
> +			bio_endio(bio, -ENODEV);
> +			continue;
> +		}
> +#endif

If your device is hot removed, it won't be in the nvme dev_list so it
can't be polled to resubmit bio's, right?

GENHD_FL_UP won't be unset either until we call del_gendisk, and that
doesn't happen while the device is polled, so this would be dead code.

... but ...

> @@ -2556,6 +2625,16 @@ static void nvme_remove(struct pci_dev *pdev)
> {
> 	struct nvme_dev *dev = pci_get_drvdata(pdev);
>
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	if (!pdev || !dev)
> +		return;
> +	if (nvme_check_surprise_removal(dev)) {
> +		set_bit(NVME_HOT_REM, &dev->hp_flag);
> +		dev_info(&pdev->dev,
> +			"Surprise removal of device 0x%x\n", pdev->device);
> +	}
> +	pci_dev_get(pdev);
> +#endif

Aha! The above wouldn't be necessary if you didn't remove the list_del
operation from your previous patch.

> 	pci_set_drvdata(pdev, NULL);
> 	flush_work(&dev->reset_work);
> 	misc_deregister(&dev->miscdev);
> @@ -2565,6 +2644,9 @@ static void nvme_remove(struct pci_dev *pdev)
> 	nvme_release_instance(dev);
> 	nvme_release_prp_pools(dev);
> 	kref_put(&dev->kref, nvme_free_dev);
> +#ifdef CONFIG_BLK_DEV_NVME_HP
> +	pci_dev_put(pdev);
> +#endif
> }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 17:21   ` Keith Busch
@ 2013-12-31  8:48     ` Ravi Kumar
  0 siblings, 0 replies; 23+ messages in thread
From: Ravi Kumar @ 2013-12-31  8:48 UTC (permalink / raw)


 Keith,

>> When a NVMe device is surprise removed and inserted back the
>> device may need some time to respond to host IO commands, and
>> will return NVME_SC_NS_NOT_READY. In this case the requests
>> will be requeued until the device responds to the IO commands
>> with status response or until the commands timeout.


> Does your device violate the spec? The driver can't send IO commands
> until the controller sets CSTS.RDY to 1, and the controller should not
> do that until it is ready to handle host commands.

Does spec mandate that CSTS.RDY should be set after Namespace ready. I can
see the case where as device have multiple Namespaces and out of many few
are ready for processing the commands rests are not ready. Don't we have to
handle the same.

I have checked the spec CSTS.RDY can be set by device without Namespace ready.

Regards,
Ravi

On Mon, Dec 30, 2013@10:51 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, 30 Dec 2013, Santosh Y wrote:
>>
>> This patch provides basic hotplug support for NVMe.
>> For NVMe hotplug to work the PCIe slot must be hotplug capable.
>
>
> I think the current code should handle hot-plug events just fine. I
> would be very interested to know what you're doing that breaks this if
> it is not working for you.
>
>
>> When a NVMe device is surprise removed and inserted back the
>> device may need some time to respond to host IO commands, and
>> will return NVME_SC_NS_NOT_READY. In this case the requests
>> will be requeued until the device responds to the IO commands
>> with status response or until the commands timeout.
>>
>
> Does your device violate the spec? The driver can't send IO commands
> until the controller sets CSTS.RDY to 1, and the controller should not
> do that until it is ready to handle host commands.
>
>
>> Signed-off-by: Ravi Kumar <ravi.android at gmail.com>
>> Signed-off-by: Santosh Y <santoshsy at gmail.com>
>>
>> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
>> index 86b9f37..f92ec96 100644
>> --- a/drivers/block/Kconfig
>> +++ b/drivers/block/Kconfig
>
>
> [snip]
>
>
>> static void nvme_make_request(struct request_queue *q, struct bio *bio)
>> {
>> -       struct nvme_ns *ns = q->queuedata;
>> -       struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
>> +       struct nvme_ns *ns = NULL;
>> +       struct nvme_queue *nvmeq = NULL;
>>         int result = -EBUSY;
>>
>> +       if (likely(q && q->queuedata))
>> +               ns = q->queuedata;
>> +       if (unlikely(!ns)) {
>> +               bio_endio(bio, -ENODEV);
>> +               return;
>> +       }
>> +
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +       if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
>> +               !(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
>> +               bio_endio(bio, -ENODEV);
>
>
> I don't know if ENODEV is a valid error here. You can't get here unless
> the device has already been opened, and it cann't be opened if the device
> does not exist, so I think this is an incorrect errno to use. Maybe ENXIO?
>
> But why is this check even necessary? Let's say the device has been
> hot-removed. If the platform is truely capable of handling such
> events, the pci layer will call the driver's 'remove', and it will suspend
> all the queues, then cancel everything. New IO submitted at this point
> will get put on the bio_list and be failed shortly after.
>
> If you're platform is not capable of pcie hot-plug events, the kthread
> reset handler will pick up on this since the CSTS.CFS bit will be set
> and the device will be queued for reset, which will fail and trigger a
> forced device removal. From there, it will look the same to the driver
> as if it was called from the pci layer.
>
>
>> +               return;
>> +       }
>> +#endif
>> +       nvmeq = get_nvmeq(ns->dev);
>> +
>>         if (!nvmeq) {
>>                 put_nvmeq(NULL);
>>                 bio_endio(bio, -EIO);
>> @@ -1120,6 +1169,12 @@ static void nvme_cancel_ios(struct nvme_queue
>> *nvmeq, bool timeout)
>>                         .status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
>>                 };
>>
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +               if (test_bit(NVME_HOT_REM, &nvmeq->dev->hp_flag)) {
>> +                       cqe.status |= (NVME_SC_INVALID_NS << 1);
>> +                       info[cmdid].timeout = jiffies - NVME_IO_TIMEOUT;
>> +               }
>> +#endif
>
>
> If the device is hot removed, the existing code will force cancel all
> IOs, so this doesn't seem necessary. Also, OR'ing in invalid namespace
> on top of abort status wouldn't be right.
>
>
>>                 if (timeout && !time_after(now, info[cmdid].timeout))
>>                         continue;
>>                 if (info[cmdid].ctx == CMD_CTX_CANCELLED)
>> @@ -1205,7 +1260,7 @@ static void nvme_disable_queue(struct nvme_dev *dev,
>> int qid)
>>
>>         /* Don't tell the adapter to delete the admin queue.
>>          * Don't tell a removed adapter to delete IO queues. */
>> -       if (qid && readl(&dev->bar->csts) != -1) {
>> +       if (qid && !nvme_check_surprise_removal(dev)) {
>>                 adapter_delete_sq(dev, qid);
>>                 adapter_delete_cq(dev, qid);
>>         }
>> @@ -1724,6 +1779,13 @@ static void nvme_resubmit_bios(struct nvme_queue
>> *nvmeq)
>>                 struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
>>                 struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
>>
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +               if (test_bit(NVME_HOT_REM, &ns->dev->hp_flag) ||
>> +                       !(bio->bi_bdev->bd_disk->flags & GENHD_FL_UP)) {
>> +                       bio_endio(bio, -ENODEV);
>> +                       continue;
>> +               }
>> +#endif
>
>
> If your device is hot removed, it won't be in the nvme dev_list so it
> can't be polled to resubmit bio's, right?
>
> GENHD_FL_UP won't be unset either until we call del_gendisk, and that
> doesn't happen while the device is polled, so this would be dead code.
>
> ... but ...
>
>
>> @@ -2556,6 +2625,16 @@ static void nvme_remove(struct pci_dev *pdev)
>> {
>>         struct nvme_dev *dev = pci_get_drvdata(pdev);
>>
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +       if (!pdev || !dev)
>> +               return;
>> +       if (nvme_check_surprise_removal(dev)) {
>> +               set_bit(NVME_HOT_REM, &dev->hp_flag);
>> +               dev_info(&pdev->dev,
>> +                       "Surprise removal of device 0x%x\n",
>> pdev->device);
>> +       }
>> +       pci_dev_get(pdev);
>> +#endif
>
>
> Aha! The above wouldn't be necessary if you didn't remove the list_del
> operation from your previous patch.
>
>
>>         pci_set_drvdata(pdev, NULL);
>>         flush_work(&dev->reset_work);
>>         misc_deregister(&dev->miscdev);
>> @@ -2565,6 +2644,9 @@ static void nvme_remove(struct pci_dev *pdev)
>>         nvme_release_instance(dev);
>>         nvme_release_prp_pools(dev);
>>         kref_put(&dev->kref, nvme_free_dev);
>> +#ifdef CONFIG_BLK_DEV_NVME_HP
>> +       pci_dev_put(pdev);
>> +#endif
>> }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
                     ` (4 preceding siblings ...)
  2013-12-30 17:21   ` Keith Busch
@ 2013-12-31 13:35   ` Matthew Wilcox
  2013-12-31 17:17     ` Matthew Wilcox
  5 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2013-12-31 13:35 UTC (permalink / raw)


On Mon, Dec 30, 2013@03:57:17PM +0530, Santosh Y wrote:
> +		if ((status & 0xff) == NVME_SC_INVALID_NS) {
> +			bio_endio(bio, -ENODEV);
> +		} else if ((status & 0xff) == NVME_SC_NS_NOT_READY) {
> +			bio->bi_rw |= REQ_FAILFAST_DRIVER;
> +			nvme_requeue_bio(dev, bio);
> +		} else

BTW, ANDing with 0xff is wrong.  The NVME_SC_NS_NOT_READY test would also
catch NVME_SC_GUARD_CHECK (value 0x282).  We should include:

	NVME_SC_MASK			= 0xbff

as part of the enum, or perhaps have a macro:

#define nvme_status_code(status)	((status) & 0xbff)


BTW, I'm going to update the status codes with the ones found in spec 1.1a:


diff --git a/include/uapi/linux/nvme.h b/include/uapi/linux/nvme.h
index f009c15..ca4a9fa 100644
--- a/include/uapi/linux/nvme.h
+++ b/include/uapi/linux/nvme.h
@@ -412,9 +412,15 @@ enum {
 	NVME_SC_FUSED_MISSING		= 0xa,
 	NVME_SC_INVALID_NS		= 0xb,
 	NVME_SC_CMD_SEQ_ERROR		= 0xc,
+	NVME_SC_SGL_LAST_SEG		= 0xd,
+	NVME_SC_SGL_COUNT		= 0xe,
+	NVME_SC_SGL_DATA_LEN		= 0xf,
+	NVME_SC_SGL_METADATA_LEN	= 0x10,
+	NVME_SC_SGL_TYPE		= 0x11,
 	NVME_SC_LBA_RANGE		= 0x80,
 	NVME_SC_CAP_EXCEEDED		= 0x81,
 	NVME_SC_NS_NOT_READY		= 0x82,
+	NVME_SC_RESERVATION_CONFLICT	= 0x83,
 	NVME_SC_CQ_INVALID		= 0x100,
 	NVME_SC_QID_INVALID		= 0x101,
 	NVME_SC_QUEUE_SIZE		= 0x102,
@@ -426,7 +432,15 @@ enum {
 	NVME_SC_INVALID_VECTOR		= 0x108,
 	NVME_SC_INVALID_LOG_PAGE	= 0x109,
 	NVME_SC_INVALID_FORMAT		= 0x10a,
+	NVME_SC_CONVENTIONAL_RESET	= 0x10b,
+	NVME_SC_INVALID_QUEUE		= 0x10c,
+	NVME_SC_FEAT_NOT_SAVABLE	= 0x10d,
+	NVME_SC_FEAT_NOT_CHANGABLE	= 0x10e,
+	NVME_SC_FEAT_NOT_NS		= 0x10f,
+	NVME_SC_SUBSYSTEM_RESET		= 0x110,
 	NVME_SC_BAD_ATTRIBUTES		= 0x180,
+	NVME_SC_INVALID_PI		= 0x181,
+	NVME_SC_RANGE_RO		= 0x182,
 	NVME_SC_WRITE_FAULT		= 0x280,
 	NVME_SC_READ_ERROR		= 0x281,
 	NVME_SC_GUARD_CHECK		= 0x282,

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support
  2013-12-31 13:35   ` Matthew Wilcox
@ 2013-12-31 17:17     ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2013-12-31 17:17 UTC (permalink / raw)


On Tue, Dec 31, 2013@08:35:40AM -0500, Matthew Wilcox wrote:
> 	NVME_SC_MASK			= 0xbff

uhh .. 7ff.  Don't know what I was thinking.

Oh, and then we should convert the code in nvme-scsi.c to use the new
macro or constant instead of the bare 7ff that it currently uses.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2013-12-30 13:50   ` Matthew Wilcox
  2013-12-30 15:55     ` Keith Busch
@ 2014-03-28 14:02     ` Santosh Y
  2014-03-28 16:29       ` Keith Busch
  2014-04-11 13:59       ` Matthew Wilcox
  1 sibling, 2 replies; 23+ messages in thread
From: Santosh Y @ 2014-03-28 14:02 UTC (permalink / raw)


On Mon, Dec 30, 2013@7:20 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Dec 30, 2013@03:57:18PM +0530, Santosh Y wrote:
>> This patch provides asynchronous device enumeration
>> capability. The 'probe' need not wait until the namespace scanning is
>> complete.
>
> I'm very interested in having something like this, except I don't think
> it's complete.  You don't seem to handle the cases where the device
> is shut down in the middle of an async scan.  Also, the only piece you
> seem to make async is the calls to add_disk(), which are surely not the
> timeconsuming parts of the scan.  I would think the time consuming parts
> are sending the IDENTIFY commands, which you don't make async.
>

Would you prefer changes in the following patch. Please let me know
your comments.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b59a93a..2f65f89 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -17,6 +17,7 @@
  */

 #include <linux/nvme.h>
+#include <linux/async.h>
 #include <linux/bio.h>
 #include <linux/bitops.h>
 #include <linux/blkdev.h>
@@ -1993,6 +1994,49 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
     return result;
 }

+static void nvme_async_add(void *data, async_cookie_t cookie)
+{
+    struct nvme_dev *dev = (struct nvme_dev *) data;
+    struct pci_dev *pdev = dev->pci_dev;
+    struct nvme_ns *ns;
+    struct nvme_id_ns *id_ns;
+    void *mem;
+    dma_addr_t dma_addr;
+    unsigned i;
+    int res;
+
+    mem = dma_alloc_coherent(&pdev->dev, 8192, &dma_addr, GFP_KERNEL);
+    if (!mem)
+        return;
+
+    id_ns = mem;
+    for (i = 1; i <= dev->nn; i++) {
+        res = nvme_identify(dev, i, 0, dma_addr);
+
+        if (res)
+            continue;
+
+        if (id_ns->ncap == 0)
+            continue;
+
+        res = nvme_get_features(dev, NVME_FEAT_LBA_RANGE, i,
+                    dma_addr + 4096, NULL);
+        if (res)
+            memset(mem + 4096, 0, 4096);
+        ns = nvme_alloc_ns(dev, i, mem, mem + 4096);
+        if (ns)
+            list_add_tail(&ns->list, &dev->namespaces);
+    }
+
+    mutex_lock(&dev->async_mutex);
+    list_for_each_entry(ns, &dev->namespaces, list)
+    add_disk(ns->disk);
+    mutex_unlock(&dev->async_mutex);
+
+    dev->initialized = 1;
+    dma_free_coherent(&pdev->dev, 8192, mem, dma_addr);
+}
+
 /*
  * Return: error value if an error occurred setting up the queues or calling
  * Identify Device.  0 if these succeeded, even if adding some of the
@@ -2003,15 +2047,12 @@ static int nvme_dev_add(struct nvme_dev *dev)
 {
     struct pci_dev *pdev = dev->pci_dev;
     int res;
-    unsigned nn, i;
-    struct nvme_ns *ns;
     struct nvme_id_ctrl *ctrl;
-    struct nvme_id_ns *id_ns;
     void *mem;
     dma_addr_t dma_addr;
     int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;

-    mem = dma_alloc_coherent(&pdev->dev, 8192, &dma_addr, GFP_KERNEL);
+    mem = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
     if (!mem)
         return -ENOMEM;

@@ -2022,7 +2063,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
     }

     ctrl = mem;
-    nn = le32_to_cpup(&ctrl->nn);
+    dev->nn = le32_to_cpup(&ctrl->nn);
     dev->oncs = le16_to_cpup(&ctrl->oncs);
     dev->abort_limit = ctrl->acl + 1;
     memcpy(dev->serial, ctrl->sn, sizeof(ctrl->sn));
@@ -2034,30 +2075,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
             (pdev->device == 0x0953) && ctrl->vs[3])
         dev->stripe_size = 1 << (ctrl->vs[3] + shift);

-    id_ns = mem;
-    for (i = 1; i <= nn; i++) {
-        res = nvme_identify(dev, i, 0, dma_addr);
-        if (res)
-            continue;
-
-        if (id_ns->ncap == 0)
-            continue;
-
-        res = nvme_get_features(dev, NVME_FEAT_LBA_RANGE, i,
-                            dma_addr + 4096, NULL);
-        if (res)
-            memset(mem + 4096, 0, 4096);
-
-        ns = nvme_alloc_ns(dev, i, mem, mem + 4096);
-        if (ns)
-            list_add_tail(&ns->list, &dev->namespaces);
-    }
-    list_for_each_entry(ns, &dev->namespaces, list)
-        add_disk(ns->disk);
+    async_schedule(nvme_async_add, dev);
     res = 0;

  out:
-    dma_free_coherent(&dev->pci_dev->dev, 8192, mem, dma_addr);
+    dma_free_coherent(&pdev->dev, 4096, mem, dma_addr);
     return res;
 }

@@ -2501,6 +2523,7 @@ static int nvme_probe(struct pci_dev *pdev,
const struct pci_device_id *id)
         goto free;

     INIT_LIST_HEAD(&dev->namespaces);
+    mutex_init(&dev->async_mutex);
     dev->pci_dev = pdev;
     pci_set_drvdata(pdev, dev);
     result = nvme_set_instance(dev);
@@ -2532,7 +2555,6 @@ static int nvme_probe(struct pci_dev *pdev,
const struct pci_device_id *id)
     if (result)
         goto remove;

-    dev->initialized = 1;
     kref_init(&dev->kref);
     return 0;

@@ -2560,6 +2582,7 @@ static void nvme_remove(struct pci_dev *pdev)
     list_del_init(&dev->node);
     spin_unlock(&dev_list_lock);

+    mutex_lock(&dev->async_mutex);
     pci_set_drvdata(pdev, NULL);
     flush_work(&dev->reset_work);
     misc_deregister(&dev->miscdev);
@@ -2569,6 +2592,7 @@ static void nvme_remove(struct pci_dev *pdev)
     nvme_release_instance(dev);
     nvme_release_prp_pools(dev);
     kref_put(&dev->kref, nvme_free_dev);
+    mutex_unlock(&dev->async_mutex);
 }

 /* These functions are yet to be implemented */
@@ -2583,7 +2607,9 @@ static int nvme_suspend(struct device *dev)
     struct pci_dev *pdev = to_pci_dev(dev);
     struct nvme_dev *ndev = pci_get_drvdata(pdev);

+    mutex_lock(&ndev->async_mutex);
     nvme_dev_shutdown(ndev);
+    mutex_unlock(&ndev->async_mutex);
     return 0;
 }

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 69ae03f..4ada390 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -97,6 +97,8 @@ struct nvme_dev {
     u16 oncs;
     u16 abort_limit;
     u8 initialized;
+    unsigned nn;
+    struct mutex async_mutex;
 };

 /*
-- 
1.8.3.2


-- 
~Santosh

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2014-03-28 14:02     ` Santosh Y
@ 2014-03-28 16:29       ` Keith Busch
  2014-04-11 13:59       ` Matthew Wilcox
  1 sibling, 0 replies; 23+ messages in thread
From: Keith Busch @ 2014-03-28 16:29 UTC (permalink / raw)


On Fri, 28 Mar 2014, Santosh Y wrote:
> On Mon, Dec 30, 2013@7:20 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
>> On Mon, Dec 30, 2013@03:57:18PM +0530, Santosh Y wrote:
>>> This patch provides asynchronous device enumeration
>>> capability. The 'probe' need not wait until the namespace scanning is
>>> complete.
>>
>> I'm very interested in having something like this, except I don't think
>> it's complete.  You don't seem to handle the cases where the device
>> is shut down in the middle of an async scan.  Also, the only piece you
>> seem to make async is the calls to add_disk(), which are surely not the
>> timeconsuming parts of the scan.  I would think the time consuming parts
>> are sending the IDENTIFY commands, which you don't make async.
>>
>
> Would you prefer changes in the following patch. Please let me know
> your comments.

This still doesn't look safe in the case a remove occurs during an async
scan. I think you need to do something like save the async_cookie_t when
you start it and call async_synchronize_cookie() in the removal path.

> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index b59a93a..2f65f89 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -17,6 +17,7 @@
>  */
>
> #include <linux/nvme.h>
> +#include <linux/async.h>
> #include <linux/bio.h>
> #include <linux/bitops.h>
> #include <linux/blkdev.h>
> @@ -1993,6 +1994,49 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>     return result;
> }
>
> +static void nvme_async_add(void *data, async_cookie_t cookie)
> +{
> +    struct nvme_dev *dev = (struct nvme_dev *) data;
> +    struct pci_dev *pdev = dev->pci_dev;
> +    struct nvme_ns *ns;
> +    struct nvme_id_ns *id_ns;
> +    void *mem;
> +    dma_addr_t dma_addr;
> +    unsigned i;
> +    int res;
> +
> +    mem = dma_alloc_coherent(&pdev->dev, 8192, &dma_addr, GFP_KERNEL);
> +    if (!mem)
> +        return;
> +
> +    id_ns = mem;
> +    for (i = 1; i <= dev->nn; i++) {
> +        res = nvme_identify(dev, i, 0, dma_addr);
> +
> +        if (res)
> +            continue;
> +
> +        if (id_ns->ncap == 0)
> +            continue;
> +
> +        res = nvme_get_features(dev, NVME_FEAT_LBA_RANGE, i,
> +                    dma_addr + 4096, NULL);
> +        if (res)
> +            memset(mem + 4096, 0, 4096);
> +        ns = nvme_alloc_ns(dev, i, mem, mem + 4096);
> +        if (ns)
> +            list_add_tail(&ns->list, &dev->namespaces);
> +    }
> +
> +    mutex_lock(&dev->async_mutex);
> +    list_for_each_entry(ns, &dev->namespaces, list)
> +    add_disk(ns->disk);
> +    mutex_unlock(&dev->async_mutex);
> +
> +    dev->initialized = 1;
> +    dma_free_coherent(&pdev->dev, 8192, mem, dma_addr);
> +}
> +
> /*
>  * Return: error value if an error occurred setting up the queues or calling
>  * Identify Device.  0 if these succeeded, even if adding some of the
> @@ -2003,15 +2047,12 @@ static int nvme_dev_add(struct nvme_dev *dev)
> {
>     struct pci_dev *pdev = dev->pci_dev;
>     int res;
> -    unsigned nn, i;
> -    struct nvme_ns *ns;
>     struct nvme_id_ctrl *ctrl;
> -    struct nvme_id_ns *id_ns;
>     void *mem;
>     dma_addr_t dma_addr;
>     int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
>
> -    mem = dma_alloc_coherent(&pdev->dev, 8192, &dma_addr, GFP_KERNEL);
> +    mem = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
>     if (!mem)
>         return -ENOMEM;
>
> @@ -2022,7 +2063,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>     }
>
>     ctrl = mem;
> -    nn = le32_to_cpup(&ctrl->nn);
> +    dev->nn = le32_to_cpup(&ctrl->nn);
>     dev->oncs = le16_to_cpup(&ctrl->oncs);
>     dev->abort_limit = ctrl->acl + 1;
>     memcpy(dev->serial, ctrl->sn, sizeof(ctrl->sn));
> @@ -2034,30 +2075,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
>             (pdev->device == 0x0953) && ctrl->vs[3])
>         dev->stripe_size = 1 << (ctrl->vs[3] + shift);
>
> -    id_ns = mem;
> -    for (i = 1; i <= nn; i++) {
> -        res = nvme_identify(dev, i, 0, dma_addr);
> -        if (res)
> -            continue;
> -
> -        if (id_ns->ncap == 0)
> -            continue;
> -
> -        res = nvme_get_features(dev, NVME_FEAT_LBA_RANGE, i,
> -                            dma_addr + 4096, NULL);
> -        if (res)
> -            memset(mem + 4096, 0, 4096);
> -
> -        ns = nvme_alloc_ns(dev, i, mem, mem + 4096);
> -        if (ns)
> -            list_add_tail(&ns->list, &dev->namespaces);
> -    }
> -    list_for_each_entry(ns, &dev->namespaces, list)
> -        add_disk(ns->disk);
> +    async_schedule(nvme_async_add, dev);
>     res = 0;
>
>  out:
> -    dma_free_coherent(&dev->pci_dev->dev, 8192, mem, dma_addr);
> +    dma_free_coherent(&pdev->dev, 4096, mem, dma_addr);
>     return res;
> }
>
> @@ -2501,6 +2523,7 @@ static int nvme_probe(struct pci_dev *pdev,
> const struct pci_device_id *id)
>         goto free;
>
>     INIT_LIST_HEAD(&dev->namespaces);
> +    mutex_init(&dev->async_mutex);
>     dev->pci_dev = pdev;
>     pci_set_drvdata(pdev, dev);
>     result = nvme_set_instance(dev);
> @@ -2532,7 +2555,6 @@ static int nvme_probe(struct pci_dev *pdev,
> const struct pci_device_id *id)
>     if (result)
>         goto remove;
>
> -    dev->initialized = 1;
>     kref_init(&dev->kref);
>     return 0;
>
> @@ -2560,6 +2582,7 @@ static void nvme_remove(struct pci_dev *pdev)
>     list_del_init(&dev->node);
>     spin_unlock(&dev_list_lock);
>
> +    mutex_lock(&dev->async_mutex);
>     pci_set_drvdata(pdev, NULL);
>     flush_work(&dev->reset_work);
>     misc_deregister(&dev->miscdev);
> @@ -2569,6 +2592,7 @@ static void nvme_remove(struct pci_dev *pdev)
>     nvme_release_instance(dev);
>     nvme_release_prp_pools(dev);
>     kref_put(&dev->kref, nvme_free_dev);
> +    mutex_unlock(&dev->async_mutex);
> }
>
> /* These functions are yet to be implemented */
> @@ -2583,7 +2607,9 @@ static int nvme_suspend(struct device *dev)
>     struct pci_dev *pdev = to_pci_dev(dev);
>     struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> +    mutex_lock(&ndev->async_mutex);
>     nvme_dev_shutdown(ndev);
> +    mutex_unlock(&ndev->async_mutex);
>     return 0;
> }
>
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 69ae03f..4ada390 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -97,6 +97,8 @@ struct nvme_dev {
>     u16 oncs;
>     u16 abort_limit;
>     u8 initialized;
> +    unsigned nn;
> +    struct mutex async_mutex;
> };

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2014-03-28 14:02     ` Santosh Y
  2014-03-28 16:29       ` Keith Busch
@ 2014-04-11 13:59       ` Matthew Wilcox
  2014-04-13 17:42         ` Matthew Wilcox
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2014-04-11 13:59 UTC (permalink / raw)


On Fri, Mar 28, 2014@07:32:54PM +0530, Santosh Y wrote:
> On Mon, Dec 30, 2013@7:20 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> > On Mon, Dec 30, 2013@03:57:18PM +0530, Santosh Y wrote:
> >> This patch provides asynchronous device enumeration
> >> capability. The 'probe' need not wait until the namespace scanning is
> >> complete.
> >
> > I'm very interested in having something like this, except I don't think
> > it's complete.  You don't seem to handle the cases where the device
> > is shut down in the middle of an async scan.  Also, the only piece you
> > seem to make async is the calls to add_disk(), which are surely not the
> > timeconsuming parts of the scan.  I would think the time consuming parts
> > are sending the IDENTIFY commands, which you don't make async.
> >
> 
> Would you prefer changes in the following patch. Please let me know
> your comments.

There are a few things I'd like to see done differently.

For motivation, I have a device here which erroneously reports that it
has billions of namespaces.  Every time I forget to include the patch to
change 'nn' to 1 for that PCI device ID, the driver sends billions of
IDENTIFY NAMESPACE commands to that device.  While I wait for that to
happen, it gives me plenty of time to look at the system and think about
the consequences of the code we have today :-)

1. I want to be able to remove the module before the scan has finished.
So the 'wait for the mutex' design you have doesn't work for this case;
we need to be able to interrupt the scan.

2. Right now all of the work is being carried out by a kworker.  Since we
have an nvme kthread, I'd like for that to be doing the work so someone
who doesn't know nvme has a decent chance of figuring out that their
system is misbehaving because nvme is doing something unusual.

3. The system is spewing out lots of nasty messages about tasks hanging
waiting for mutexes, eg:

[  721.687298] INFO: task modprobe:1072 blocked for more than 120 seconds.

The mutex it is waiting for is being held by the upper level code which
has called the nvme driver to initialise.  Therefore we must return from
the nvme driver probe code before scanning namespaces has completed.

4. There is an accepted TP for nvme 1.2 that allows the device to send
an Async Notification that some aspect of the namespace inventory has
changed, and invites us to rescan.  I would love to take that into
account in a patch that overhauls scanning.

5. A really good patch would also try to use the nvme 1.1 capability to
report sparse namespaces (IDENTIFY with CNS = 2).

6. You need to find a better email program; the one you have is converting
tabs to spaces.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/5] NVMe: Asynchronous device scan support
  2014-04-11 13:59       ` Matthew Wilcox
@ 2014-04-13 17:42         ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2014-04-13 17:42 UTC (permalink / raw)


On Fri, Apr 11, 2014@09:59:01AM -0400, Matthew Wilcox wrote:
> For motivation, I have a device here which erroneously reports that it
> has billions of namespaces.  Every time I forget to include the patch to
> change 'nn' to 1 for that PCI device ID, the driver sends billions of
> IDENTIFY NAMESPACE commands to that device.  While I wait for that to
> happen, it gives me plenty of time to look at the system and think about
> the consequences of the code we have today :-)

By the way, this patch looks like a good start on tackling the problem:

http://lists.infradead.org/pipermail/linux-nvme/2013-September/000470.html

Some of the infrastructure it adds is already present, eg
nvme_submit_admin_cmd_async().  Allocating one discovery event per
namespace seems like a bad idea; since we can't have more than 64 admin
commands in flight, it would seem reasonable to limit the number of
disco events that can be outstanding at any time.  With Keith's example
2 million namespaces, it was allocating 16GB of DMA consistent memory
... not too much of a problem on a typical dev box these days, but I
wouldn't like to be a NAS box.

P.S.: I am, though, greatly in favour of a free disco event.  Perhaps we
can arrange one at a conference soon?

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-04-13 17:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-30 10:27 [PATCH RFC 0/5] NVMe: Hotplug support Santosh Y
2013-12-30 10:27 ` [PATCH RFC 1/5] NVMe: Code cleanup and minor checkpatch correction Santosh Y
2013-12-30 13:32   ` Matthew Wilcox
2013-12-30 14:52   ` Keith Busch
2013-12-30 10:27 ` [PATCH RFC 2/5] NVMe: Basic NVMe device hotplug support Santosh Y
2013-12-30 13:46   ` Matthew Wilcox
2013-12-30 13:48   ` Matias Bjorling
2013-12-30 14:09   ` Matias Bjorling
2013-12-30 16:06   ` Keith Busch
2013-12-30 17:21   ` Keith Busch
2013-12-31  8:48     ` Ravi Kumar
2013-12-31 13:35   ` Matthew Wilcox
2013-12-31 17:17     ` Matthew Wilcox
2013-12-30 10:27 ` [PATCH RFC 3/5] NVMe: Asynchronous device scan support Santosh Y
2013-12-30 13:50   ` Matthew Wilcox
2013-12-30 15:55     ` Keith Busch
2014-03-28 14:02     ` Santosh Y
2014-03-28 16:29       ` Keith Busch
2014-04-11 13:59       ` Matthew Wilcox
2014-04-13 17:42         ` Matthew Wilcox
2013-12-30 10:27 ` [PATCH RFC 4/5] NVMe: Stale node cleanup based on reference count Santosh Y
2013-12-30 14:00   ` Matthew Wilcox
2013-12-30 10:27 ` [PATCH RFC 5/5] NVMe: Hotplug support during hibernate/sleep states Santosh Y

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).