Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [GIT PULL 14/19] lightnvm: fix type checks on rrpc
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

sector_t is always unsigned, therefore avoid < 0 checks on it.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/rrpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 5dba544..cf0e28a 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -817,7 +817,7 @@ static int rrpc_read_ppalist_rq(struct rrpc *rrpc, struct bio *bio,
 
 	for (i = 0; i < npages; i++) {
 		/* We assume that mapping occurs at 4KB granularity */
-		BUG_ON(!(laddr + i >= 0 && laddr + i < rrpc->nr_sects));
+		BUG_ON(!(laddr + i < rrpc->nr_sects));
 		gp = &rrpc->trans_map[laddr + i];
 
 		if (gp->rblk) {
@@ -846,7 +846,7 @@ static int rrpc_read_rq(struct rrpc *rrpc, struct bio *bio, struct nvm_rq *rqd,
 	if (!is_gc && rrpc_lock_rq(rrpc, bio, rqd))
 		return NVM_IO_REQUEUE;
 
-	BUG_ON(!(laddr >= 0 && laddr < rrpc->nr_sects));
+	BUG_ON(!(laddr < rrpc->nr_sects));
 	gp = &rrpc->trans_map[laddr];
 
 	if (gp->rblk) {
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 12/19] lightnvm: make nvm_free static
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

Prefix the nvm_free static function with a missing static keyword.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index a63b563..eb9ab1a 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -999,7 +999,7 @@ static int nvm_core_init(struct nvm_dev *dev)
 	return ret;
 }
 
-void nvm_free(struct nvm_dev *dev)
+static void nvm_free(struct nvm_dev *dev)
 {
 	if (!dev)
 		return;
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 11/19] lightnvm: allow to init targets on factory mode
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

Target initialization has two responsibilities: creating the target
partition and instantiating the target. This patch enables to create a
factory partition (e.g., do not trigger recovery on the given target).
This is useful for target development and for being able to restore the
device state at any moment in time without requiring a full-device
erase.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c       | 14 +++++++++++---
 drivers/lightnvm/rrpc.c       |  3 ++-
 include/linux/lightnvm.h      |  3 ++-
 include/uapi/linux/lightnvm.h |  4 ++++
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 5f84d2a..a63b563 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -280,7 +280,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 	tdisk->fops = &nvm_fops;
 	tdisk->queue = tqueue;
 
-	targetdata = tt->init(tgt_dev, tdisk);
+	targetdata = tt->init(tgt_dev, tdisk, create->flags);
 	if (IS_ERR(targetdata))
 		goto err_init;
 
@@ -1244,8 +1244,16 @@ static long nvm_ioctl_dev_create(struct file *file, void __user *arg)
 	create.tgtname[DISK_NAME_LEN - 1] = '\0';
 
 	if (create.flags != 0) {
-		pr_err("nvm: no flags supported\n");
-		return -EINVAL;
+		__u32 flags = create.flags;
+
+		/* Check for valid flags */
+		if (flags & NVM_TARGET_FACTORY)
+			flags &= ~NVM_TARGET_FACTORY;
+
+		if (flags) {
+			pr_err("nvm: flag not supported\n");
+			return -EINVAL;
+		}
 	}
 
 	return __nvm_configure_create(&create);
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index a8acf9e..5dba544 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -1506,7 +1506,8 @@ static int rrpc_luns_configure(struct rrpc *rrpc)
 
 static struct nvm_tgt_type tt_rrpc;
 
-static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk)
+static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
+		       int flags)
 {
 	struct request_queue *bqueue = dev->q;
 	struct request_queue *tqueue = tdisk->queue;
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index eff7d1f..7dfa56e 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -436,7 +436,8 @@ static inline int ppa_cmp_blk(struct ppa_addr ppa1, struct ppa_addr ppa2)
 
 typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *);
 typedef sector_t (nvm_tgt_capacity_fn)(void *);
-typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *);
+typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *,
+				int flags);
 typedef void (nvm_tgt_exit_fn)(void *);
 typedef int (nvm_tgt_sysfs_init_fn)(struct gendisk *);
 typedef void (nvm_tgt_sysfs_exit_fn)(struct gendisk *);
diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h
index fd19f36..c8aec4b 100644
--- a/include/uapi/linux/lightnvm.h
+++ b/include/uapi/linux/lightnvm.h
@@ -85,6 +85,10 @@ struct nvm_ioctl_create_conf {
 	};
 };
 
+enum {
+	NVM_TARGET_FACTORY = 1 << 0,	/* Init target in factory mode */
+};
+
 struct nvm_ioctl_create {
 	char dev[DISK_NAME_LEN];		/* open-channel SSD device */
 	char tgttype[NVM_TTYPE_NAME_MAX];	/* target type name */
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 10/19] lightnvm: bad type conversion for nvme control bits
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

The NVMe I/O command control bits are 16 bytes, but is interpreted as
32 bytes in the lightnvm user I/O data path.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/nvme/host/lightnvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 12c5a40..4b78090 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -754,7 +754,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
 	c.ph_rw.length = cpu_to_le16(vcmd.nppas);
-	c.ph_rw.control  = cpu_to_le32(vcmd.control);
+	c.ph_rw.control  = cpu_to_le16(vcmd.control);
 	c.common.cdw10[3] = cpu_to_le32(vcmd.cdw13);
 	c.common.cdw10[4] = cpu_to_le32(vcmd.cdw14);
 	c.common.cdw10[5] = cpu_to_le32(vcmd.cdw15);
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 09/19] lightnvm: fix cleanup order of disk on init error
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

Reorder disk allocation such that the disk structure can be put
safely.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 5eea3d5..5f84d2a 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -264,15 +264,15 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 		goto err_t;
 	}
 
+	tdisk = alloc_disk(0);
+	if (!tdisk)
+		goto err_dev;
+
 	tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
 	if (!tqueue)
-		goto err_dev;
+		goto err_disk;
 	blk_queue_make_request(tqueue, tt->make_rq);
 
-	tdisk = alloc_disk(0);
-	if (!tdisk)
-		goto err_queue;
-
 	sprintf(tdisk->disk_name, "%s", create->tgtname);
 	tdisk->flags = GENHD_FL_EXT_DEVT;
 	tdisk->major = 0;
@@ -308,9 +308,9 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 	if (tt->exit)
 		tt->exit(targetdata);
 err_init:
-	put_disk(tdisk);
-err_queue:
 	blk_cleanup_queue(tqueue);
+err_disk:
+	put_disk(tdisk);
 err_dev:
 	nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 08/19] lightnvm: double-clear of dev->lun_map on target init error
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

The dev->lun_map bits are cleared twice if an target init error occurs.
First in the target clean routine, and then next in the nvm_tgt_create
error function. Make sure that it is only cleared once by extending
nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can
ignored when cleaning up a successful initialized target.

Signed-off-by: Javier González <javier@cnexlabs.com>
Fix style.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index a14c52c..5eea3d5 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -89,7 +89,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int lun_begin,
 		WARN_ON(!test_and_clear_bit(i, dev->lun_map));
 }
 
-static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
+static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
 {
 	struct nvm_dev *dev = tgt_dev->parent;
 	struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -100,11 +100,14 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
 		int *lun_offs = ch_map->lun_offs;
 		int ch = i + ch_map->ch_off;
 
-		for (j = 0; j < ch_map->nr_luns; j++) {
-			int lun = j + lun_offs[j];
-			int lunid = (ch * dev->geo.luns_per_chnl) + lun;
+		if (clear) {
+			for (j = 0; j < ch_map->nr_luns; j++) {
+				int lun = j + lun_offs[j];
+				int lunid = (ch * dev->geo.luns_per_chnl) + lun;
 
-			WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+				WARN_ON(!test_and_clear_bit(lunid,
+							dev->lun_map));
+			}
 		}
 
 		kfree(ch_map->lun_offs);
@@ -309,7 +312,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 err_queue:
 	blk_cleanup_queue(tqueue);
 err_dev:
-	nvm_remove_tgt_dev(tgt_dev);
+	nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
 	kfree(t);
 err_reserve:
@@ -332,7 +335,7 @@ static void __nvm_remove_target(struct nvm_target *t)
 	if (tt->exit)
 		tt->exit(tdisk->private_data);
 
-	nvm_remove_tgt_dev(t->dev);
+	nvm_remove_tgt_dev(t->dev, 1);
 	put_disk(tdisk);
 
 	list_del(&t->list);
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 07/19] lightnvm: don't check for failure from mempool_alloc()
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, NeilBrown, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: NeilBrown <neilb@suse.com>

mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_KERNEL and GFP_NOIO allows for sleeping.

So rrpc_move_valid_pages() and rrpc_make_rq() don't need to
test the return value.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/rrpc.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 4e4c299..a8acf9e 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -318,10 +318,6 @@ static int rrpc_move_valid_pages(struct rrpc *rrpc, struct rrpc_block *rblk)
 	}
 
 	page = mempool_alloc(rrpc->page_pool, GFP_NOIO);
-	if (!page) {
-		bio_put(bio);
-		return -ENOMEM;
-	}
 
 	while ((slot = find_first_zero_bit(rblk->invalid_pages,
 					    nr_sec_per_blk)) < nr_sec_per_blk) {
@@ -1006,11 +1002,6 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
 	}
 
 	rqd = mempool_alloc(rrpc->rq_pool, GFP_KERNEL);
-	if (!rqd) {
-		pr_err_ratelimited("rrpc: not able to queue bio.");
-		bio_io_error(bio);
-		return BLK_QC_T_NONE;
-	}
 	memset(rqd, 0, sizeof(struct nvm_rq));
 
 	err = rrpc_submit_io(rrpc, bio, rqd, NVM_IOTYPE_NONE);
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 06/19] lightnvm: enable nvme size compile asserts
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

The asserts in _nvme_nvm_check_size are not compiled due to the function
not begin called. Make sure that it is called, and also fix the wrong
sizes of asserts for nvme_nvm_addr_format, and nvme_nvm_bb_tbl, which
checked for number of bits instead of bytes.

Reported-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/nvme/host/lightnvm.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 4ea9c93..12c5a40 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -241,9 +241,9 @@ static inline void _nvme_nvm_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_nvm_l2ptbl) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_nvm_id_group) != 960);
-	BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 128);
+	BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 16);
 	BUILD_BUG_ON(sizeof(struct nvme_nvm_id) != 4096);
-	BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
 }
 
 static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id)
@@ -797,6 +797,8 @@ int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node)
 	struct request_queue *q = ns->queue;
 	struct nvm_dev *dev;
 
+	_nvme_nvm_check_size();
+
 	dev = nvm_alloc_dev(node);
 	if (!dev)
 		return -ENOMEM;
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 05/19] lightnvm: free reverse device map
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

Free the reverse mapping table correctly on target tear down

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 95105c4..a14c52c 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -411,6 +411,18 @@ static int nvm_register_map(struct nvm_dev *dev)
 	return -ENOMEM;
 }
 
+static void nvm_unregister_map(struct nvm_dev *dev)
+{
+	struct nvm_dev_map *rmap = dev->rmap;
+	int i;
+
+	for (i = 0; i < dev->geo.nr_chnls; i++)
+		kfree(rmap->chnls[i].lun_offs);
+
+	kfree(rmap->chnls);
+	kfree(rmap);
+}
+
 static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
 {
 	struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -992,7 +1004,7 @@ void nvm_free(struct nvm_dev *dev)
 	if (dev->dma_pool)
 		dev->ops->destroy_dma_pool(dev->dma_pool);
 
-	kfree(dev->rmap);
+	nvm_unregister_map(dev);
 	kfree(dev->lptbl);
 	kfree(dev->lun_map);
 	kfree(dev);
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 04/19] lightnvm: rename scrambler controller hint
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

According to the OCSSD 1.2 specification, the 0x200 hint enables the
media scrambler for the read/write opcode, providing that the controller
has been correctly configured by the firmware. Rename the macro to
represent this meaning.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 include/linux/lightnvm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index e11163f..eff7d1f 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -123,7 +123,7 @@ enum {
 	/* NAND Access Modes */
 	NVM_IO_SUSPEND		= 0x80,
 	NVM_IO_SLC_MODE		= 0x100,
-	NVM_IO_SCRAMBLE_DISABLE	= 0x200,
+	NVM_IO_SCRAMBLE_ENABLE	= 0x200,
 
 	/* Block Types */
 	NVM_BLK_T_FREE		= 0x0,
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 03/19] lightnvm: submit erases using the I/O path
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, Javier González,
	Javier González, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Javier González <jg@lightnvm.io>

Until now erases have been submitted as synchronous commands through a
dedicated erase function. In order to enable targets implementing
asynchronous erases, refactor the erase path so that it uses the normal
async I/O submission functions. If a target requires sync I/O, it can
implement it internally. Also, adapt rrpc to use the new erase path.

Signed-off-by: Javier González <javier@cnexlabs.com>
Fixed spelling error.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c      | 54 +++++++++++++++++++++++++++-----------------
 drivers/lightnvm/rrpc.c      |  3 +--
 drivers/nvme/host/lightnvm.c | 32 ++++++++------------------
 include/linux/lightnvm.h     |  8 +++----
 4 files changed, 47 insertions(+), 50 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 5262ba6..95105c4 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -590,11 +590,11 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
 
 	memset(&rqd, 0, sizeof(struct nvm_rq));
 
-	nvm_set_rqd_ppalist(dev, &rqd, ppas, nr_ppas, 1);
+	nvm_set_rqd_ppalist(tgt_dev, &rqd, ppas, nr_ppas, 1);
 	nvm_rq_tgt_to_dev(tgt_dev, &rqd);
 
 	ret = dev->ops->set_bb_tbl(dev, &rqd.ppa_addr, rqd.nr_ppas, type);
-	nvm_free_rqd_ppalist(dev, &rqd);
+	nvm_free_rqd_ppalist(tgt_dev, &rqd);
 	if (ret) {
 		pr_err("nvm: failed bb mark\n");
 		return -EINVAL;
@@ -626,34 +626,45 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_submit_io);
 
-int nvm_erase_blk(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas, int flags)
+static void nvm_end_io_sync(struct nvm_rq *rqd)
 {
-	struct nvm_dev *dev = tgt_dev->parent;
+	struct completion *waiting = rqd->private;
+
+	complete(waiting);
+}
+
+int nvm_erase_sync(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
+								int nr_ppas)
+{
+	struct nvm_geo *geo = &tgt_dev->geo;
 	struct nvm_rq rqd;
 	int ret;
-
-	if (!dev->ops->erase_block)
-		return 0;
-
-	nvm_map_to_dev(tgt_dev, ppas);
+	DECLARE_COMPLETION_ONSTACK(wait);
 
 	memset(&rqd, 0, sizeof(struct nvm_rq));
 
-	ret = nvm_set_rqd_ppalist(dev, &rqd, ppas, 1, 1);
+	rqd.opcode = NVM_OP_ERASE;
+	rqd.end_io = nvm_end_io_sync;
+	rqd.private = &wait;
+	rqd.flags = geo->plane_mode >> 1;
+
+	ret = nvm_set_rqd_ppalist(tgt_dev, &rqd, ppas, nr_ppas, 1);
 	if (ret)
 		return ret;
 
-	nvm_rq_tgt_to_dev(tgt_dev, &rqd);
+	ret = nvm_submit_io(tgt_dev, &rqd);
+	if (ret) {
+		pr_err("rrpr: erase I/O submission failed: %d\n", ret);
+		goto free_ppa_list;
+	}
+	wait_for_completion_io(&wait);
 
-	rqd.flags = flags;
-
-	ret = dev->ops->erase_block(dev, &rqd);
-
-	nvm_free_rqd_ppalist(dev, &rqd);
+free_ppa_list:
+	nvm_free_rqd_ppalist(tgt_dev, &rqd);
 
 	return ret;
 }
-EXPORT_SYMBOL(nvm_erase_blk);
+EXPORT_SYMBOL(nvm_erase_sync);
 
 int nvm_get_l2p_tbl(struct nvm_tgt_dev *tgt_dev, u64 slba, u32 nlb,
 		    nvm_l2p_update_fn *update_l2p, void *priv)
@@ -732,10 +743,11 @@ void nvm_put_area(struct nvm_tgt_dev *tgt_dev, sector_t begin)
 }
 EXPORT_SYMBOL(nvm_put_area);
 
-int nvm_set_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd,
+int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
 			const struct ppa_addr *ppas, int nr_ppas, int vblk)
 {
-	struct nvm_geo *geo = &dev->geo;
+	struct nvm_dev *dev = tgt_dev->parent;
+	struct nvm_geo *geo = &tgt_dev->geo;
 	int i, plane_cnt, pl_idx;
 	struct ppa_addr ppa;
 
@@ -773,12 +785,12 @@ int nvm_set_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd,
 }
 EXPORT_SYMBOL(nvm_set_rqd_ppalist);
 
-void nvm_free_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd)
+void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
 {
 	if (!rqd->ppa_list)
 		return;
 
-	nvm_dev_dma_free(dev, rqd->ppa_list, rqd->dma_ppa_list);
+	nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
 }
 EXPORT_SYMBOL(nvm_free_rqd_ppalist);
 
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e68efbc..4e4c299 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -414,7 +414,6 @@ static void rrpc_block_gc(struct work_struct *work)
 	struct rrpc *rrpc = gcb->rrpc;
 	struct rrpc_block *rblk = gcb->rblk;
 	struct rrpc_lun *rlun = rblk->rlun;
-	struct nvm_tgt_dev *dev = rrpc->dev;
 	struct ppa_addr ppa;
 
 	mempool_free(gcb, rrpc->gcb_pool);
@@ -430,7 +429,7 @@ static void rrpc_block_gc(struct work_struct *work)
 	ppa.g.lun = rlun->bppa.g.lun;
 	ppa.g.blk = rblk->id;
 
-	if (nvm_erase_blk(dev, &ppa, 0))
+	if (nvm_erase_sync(rrpc->dev, &ppa, 1))
 		goto put_back;
 
 	rrpc_put_blk(rrpc, rblk);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index fd98954..4ea9c93 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -510,12 +510,16 @@ static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
 	}
 	rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
 
-	rq->ioprio = bio_prio(bio);
-	if (bio_has_data(bio))
-		rq->nr_phys_segments = bio_phys_segments(q, bio);
-
-	rq->__data_len = bio->bi_iter.bi_size;
-	rq->bio = rq->biotail = bio;
+	if (bio) {
+		rq->ioprio = bio_prio(bio);
+		rq->__data_len = bio->bi_iter.bi_size;
+		rq->bio = rq->biotail = bio;
+		if (bio_has_data(bio))
+			rq->nr_phys_segments = bio_phys_segments(q, bio);
+	} else {
+		rq->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM);
+		rq->__data_len = 0;
+	}
 
 	nvme_nvm_rqtocmd(rq, rqd, ns, cmd);
 
@@ -526,21 +530,6 @@ static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
 	return 0;
 }
 
-static int nvme_nvm_erase_block(struct nvm_dev *dev, struct nvm_rq *rqd)
-{
-	struct request_queue *q = dev->q;
-	struct nvme_ns *ns = q->queuedata;
-	struct nvme_nvm_command c = {};
-
-	c.erase.opcode = NVM_OP_ERASE;
-	c.erase.nsid = cpu_to_le32(ns->ns_id);
-	c.erase.spba = cpu_to_le64(rqd->ppa_addr.ppa);
-	c.erase.length = cpu_to_le16(rqd->nr_ppas - 1);
-	c.erase.control = cpu_to_le16(rqd->flags);
-
-	return nvme_submit_sync_cmd(q, (struct nvme_command *)&c, NULL, 0);
-}
-
 static void *nvme_nvm_create_dma_pool(struct nvm_dev *nvmdev, char *name)
 {
 	struct nvme_ns *ns = nvmdev->q->queuedata;
@@ -576,7 +565,6 @@ static struct nvm_dev_ops nvme_nvm_dev_ops = {
 	.set_bb_tbl		= nvme_nvm_set_bb_tbl,
 
 	.submit_io		= nvme_nvm_submit_io,
-	.erase_block		= nvme_nvm_erase_block,
 
 	.create_dma_pool	= nvme_nvm_create_dma_pool,
 	.destroy_dma_pool	= nvme_nvm_destroy_dma_pool,
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index ca45e4a..e11163f 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -56,7 +56,6 @@ typedef int (nvm_get_l2p_tbl_fn)(struct nvm_dev *, u64, u32,
 typedef int (nvm_op_bb_tbl_fn)(struct nvm_dev *, struct ppa_addr, u8 *);
 typedef int (nvm_op_set_bb_fn)(struct nvm_dev *, struct ppa_addr *, int, int);
 typedef int (nvm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *);
-typedef int (nvm_erase_blk_fn)(struct nvm_dev *, struct nvm_rq *);
 typedef void *(nvm_create_dma_pool_fn)(struct nvm_dev *, char *);
 typedef void (nvm_destroy_dma_pool_fn)(void *);
 typedef void *(nvm_dev_dma_alloc_fn)(struct nvm_dev *, void *, gfp_t,
@@ -70,7 +69,6 @@ struct nvm_dev_ops {
 	nvm_op_set_bb_fn	*set_bb_tbl;
 
 	nvm_submit_io_fn	*submit_io;
-	nvm_erase_blk_fn	*erase_block;
 
 	nvm_create_dma_pool_fn	*create_dma_pool;
 	nvm_destroy_dma_pool_fn	*destroy_dma_pool;
@@ -479,10 +477,10 @@ extern int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *, struct ppa_addr *,
 			      int, int);
 extern int nvm_max_phys_sects(struct nvm_tgt_dev *);
 extern int nvm_submit_io(struct nvm_tgt_dev *, struct nvm_rq *);
-extern int nvm_set_rqd_ppalist(struct nvm_dev *, struct nvm_rq *,
+extern int nvm_erase_sync(struct nvm_tgt_dev *, struct ppa_addr *, int);
+extern int nvm_set_rqd_ppalist(struct nvm_tgt_dev *, struct nvm_rq *,
 					const struct ppa_addr *, int, int);
-extern void nvm_free_rqd_ppalist(struct nvm_dev *, struct nvm_rq *);
-extern int nvm_erase_blk(struct nvm_tgt_dev *, struct ppa_addr *, int);
+extern void nvm_free_rqd_ppalist(struct nvm_tgt_dev *, struct nvm_rq *);
 extern int nvm_get_l2p_tbl(struct nvm_tgt_dev *, u64, u32, nvm_l2p_update_fn *,
 			   void *);
 extern int nvm_get_area(struct nvm_tgt_dev *, sector_t *, sector_t);
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 02/19] nvme/lightnvm: Prevent small buffer overflow in nvme_nvm_identify
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Scott Bauer, Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Scott Bauer <scott.bauer@intel.com>

There are two closely named structs in lightnvm:
struct nvme_nvm_addr_format and
struct nvme_addr_format.

The first struct has 4 reserved bytes at the end, the second does not.
(gdb) p sizeof(struct nvme_nvm_addr_format)
$1 = 16
(gdb) p sizeof(struct nvm_addr_format)
$2 = 12

In the nvme_nvm_identify function we memcpy from the larger struct to the
smaller struct. We incorrectly pass the length of the larger struct
and overflow by 4 bytes, lets not do that.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/nvme/host/lightnvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 21cac85..fd98954 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -324,7 +324,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	nvm_id->cap = le32_to_cpu(nvme_nvm_id->cap);
 	nvm_id->dom = le32_to_cpu(nvme_nvm_id->dom);
 	memcpy(&nvm_id->ppaf, &nvme_nvm_id->ppaf,
-					sizeof(struct nvme_nvm_addr_format));
+					sizeof(struct nvm_addr_format));
 
 	ret = init_grps(nvm_id, nvme_nvm_id);
 out:
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 01/19] lightnvm: Fix error handling
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Christophe JAILLET,
	Matias Bjørling
In-Reply-To: <20170415185553.16098-1-matias@cnexlabs.com>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

According to error handling in this function, it is likely that going to
'out' was expected here.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/rrpc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e00b1d7..e68efbc 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -1275,8 +1275,10 @@ static int rrpc_bb_discovery(struct nvm_tgt_dev *dev, struct rrpc_lun *rlun)
 	}
 
 	nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks);
-	if (nr_blks < 0)
-		return nr_blks;
+	if (nr_blks < 0) {
+		ret = nr_blks;
+		goto out;
+	}
 
 	for (i = 0; i < nr_blks; i++) {
 		if (blks[i] == NVM_BLK_T_FREE)
-- 
2.9.3

^ permalink raw reply related

* [GIT PULL 00/19] LightNVM patches for 4.12.
From: Matias Bjørling @ 2017-04-15 18:55 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Matias Bjørling

Hi Jens,

With this merge window, we like to push pblk upstream. It is a new
host-side translation layer that implements support for exposing
Open-Channel SSDs as block devices.

We have described pblk in the LightNVM paper "LightNVM: The Linux
Open-Channel SSD Subsystem" that was accepted at FAST 2017. The paper
defines open-channel SSDs, the subsystem, pblk and has an evaluation as
well. Over the past couple of kernel versions we have shipped the
support patches for pblk, and we are now comfortable pushing the core of
pblk upstream.

The core contains the logic to control data placement and I/O scheduling
on open-channel SSDs. Including implementation of translation table
management, GC, recovery, rate-limiting, and similar components. It
assumes that the SSD is media-agnostic, and runs on both 1.2 and 2.0 of
the Open-Channel SSD specification without modifications.

I want to point out two neat features of pblk. First, pblk can be
instantiated multiple times on the same SSD, enabling I/O isolation
between tenants, and makes it able to fulfill strict QoS requirements.
We showed results from this at the NVMW '17 workshop this year, while
presenting the "Multi-Tenant I/O Isolation with Open-Channel SSDs" talk.
Second, now that a full host-side translation layer is implemented, one
can begin to optimize its data placement and I/O scheduling algorithms
to match user workloads. We have shown a couple of the benefits in the
LightNVM paper, and we know of a couple of companies and universities
that have begun making new algorithms.

In detail, this pull request contains:

 - The new host-side FTL pblk from Javier, and other contributors.

 - Add support to the "create" ioctl to force a target to be
   re-initialized at using "factory" flag from Javier.

 - Fix various errors in LightNVM core from Javier and me.

 - An optimization from Neil Brown to skip error checking on mempool
   allocations that can sleep.

 - A buffer overflow fix in nvme_nvm_identify from Scott Bauer.

 - Fix for bad block discovery handle error handling from Christophe
   Jaillet.

 - Fixes from Dan Carpenter to pblk after it went into linux-next.

Please pull from the for-jens branch or apply the patches posted with
this mail:

   https://github.com/OpenChannelSSD/linux.git for-jens

Thanks,
Matias

Christophe JAILLET (1):
  lightnvm: Fix error handling

Dan Carpenter (3):
  lightnvm: pblk-gc: fix an error pointer dereference in init
  lightnvm: fix some WARN() messages
  lightnvm: fix some error code in pblk-init.c

Javier González (12):
  lightnvm: submit erases using the I/O path
  lightnvm: rename scrambler controller hint
  lightnvm: free reverse device map
  lightnvm: double-clear of dev->lun_map on target init error
  lightnvm: fix cleanup order of disk on init error
  lightnvm: bad type conversion for nvme control bits
  lightnvm: allow to init targets on factory mode
  lightnvm: make nvm_free static
  lightnvm: clean unused variable
  lightnvm: fix type checks on rrpc
  lightnvm: convert sprintf into strlcpy
  lightnvm: physical block device (pblk) target

Matias Bjørling (1):
  lightnvm: enable nvme size compile asserts

NeilBrown (1):
  lightnvm: don't check for failure from mempool_alloc()

Scott Bauer (1):
  nvme/lightnvm: Prevent small buffer overflow in nvme_nvm_identify

 Documentation/lightnvm/pblk.txt  |   21 +
 drivers/lightnvm/Kconfig         |    9 +
 drivers/lightnvm/Makefile        |    5 +
 drivers/lightnvm/core.c          |  124 +--
 drivers/lightnvm/pblk-cache.c    |  114 +++
 drivers/lightnvm/pblk-core.c     | 1655 ++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/pblk-gc.c       |  555 +++++++++++++
 drivers/lightnvm/pblk-init.c     |  957 ++++++++++++++++++++++
 drivers/lightnvm/pblk-map.c      |  136 ++++
 drivers/lightnvm/pblk-rb.c       |  852 ++++++++++++++++++++
 drivers/lightnvm/pblk-read.c     |  529 ++++++++++++
 drivers/lightnvm/pblk-recovery.c |  998 +++++++++++++++++++++++
 drivers/lightnvm/pblk-rl.c       |  182 +++++
 drivers/lightnvm/pblk-sysfs.c    |  507 ++++++++++++
 drivers/lightnvm/pblk-write.c    |  411 ++++++++++
 drivers/lightnvm/pblk.h          | 1121 ++++++++++++++++++++++++++
 drivers/lightnvm/rrpc.c          |   25 +-
 drivers/nvme/host/lightnvm.c     |   42 +-
 include/linux/lightnvm.h         |   13 +-
 include/uapi/linux/lightnvm.h    |    4 +
 20 files changed, 8165 insertions(+), 95 deletions(-)
 create mode 100644 Documentation/lightnvm/pblk.txt
 create mode 100644 drivers/lightnvm/pblk-cache.c
 create mode 100644 drivers/lightnvm/pblk-core.c
 create mode 100644 drivers/lightnvm/pblk-gc.c
 create mode 100644 drivers/lightnvm/pblk-init.c
 create mode 100644 drivers/lightnvm/pblk-map.c
 create mode 100644 drivers/lightnvm/pblk-rb.c
 create mode 100644 drivers/lightnvm/pblk-read.c
 create mode 100644 drivers/lightnvm/pblk-recovery.c
 create mode 100644 drivers/lightnvm/pblk-rl.c
 create mode 100644 drivers/lightnvm/pblk-sysfs.c
 create mode 100644 drivers/lightnvm/pblk-write.c
 create mode 100644 drivers/lightnvm/pblk.h

-- 
2.9.3

^ permalink raw reply

* Re: Outstanding MQ questions from MMC
From: Linus Walleij @ 2017-04-15 18:34 UTC (permalink / raw)
  To: Avri Altman
  Cc: Arnd Bergmann, Ulf Hansson, linux-mmc@vger.kernel.org,
	linux-block@vger.kernel.org, Jens Axboe, Christoph Hellwig,
	Adrian Hunter, Paolo Valente
In-Reply-To: <BY2PR0401MB0901B37DB60608662AE5C631E5050@BY2PR0401MB0901.namprd04.prod.outlook.com>

On Fri, Apr 14, 2017 at 8:41 PM, Avri Altman <Avri.Altman@sandisk.com> wrote:
> [Me]
>> 2. Turn RPMB and other ioctl() MMC operations into mmc_queue_req
>>    things and funnel them into the block scheduler
>>    using REQ_OP_DRV_IN/OUT requests.
>>
>
> Accessing the RPMB is done via a strange protocol, in which each access is comprised of several requests.
> For example, writing to the RPMB will require sending 5 different requests:
> 2 requests to read the write counter, and then 3 more requests for the write operation itself.
>
> Once the sequence has started, it should not get interfered by other requests, or the operation will fail.

So I guess currently something takes a host lock and then performs the
5 requests.

Thus we need to send a single custom request containing a list of 5
things to do, and return after that.

Or do you mean that we return to userspace inbetween these different
requests and the sequencing is done in userspace?

I hope not because that sounds fragile, like userspace could crash and
leave the host lock dangling :/

Yours,
Linus Walleij

^ permalink raw reply

* [PATCH 4/4] mtip32xx: use BLK_MQ_F_USE_SCHED_TAG
From: Ming Lei @ 2017-04-15 12:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Omar Sandoval, Jozef Mikovic, Ming Lei
In-Reply-To: <20170415123825.32716-1-ming.lei@redhat.com>

This patch applys the new introduced flag of BLK_MQ_F_USE_SCHED_TAG
to make mq-deadline working on mtip32xx. With this flag, we can
allocate hardware tag for scheduler, then mtip32xx can work well.

Also mtip32xx has 256 queue depth, which is same with the default
value of q->nr_requests, so in theory performance loss won't happen.

Finally BLK_MQ_F_NO_SCHED isn't necessary any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/mtip32xx/mtip32xx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 4e344246c8dd..203b18a9eff0 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3969,7 +3969,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 	dd->tags.reserved_tags = 1;
 	dd->tags.cmd_size = sizeof(struct mtip_cmd);
 	dd->tags.numa_node = dd->numa_node;
-	dd->tags.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED;
+	dd->tags.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SCHED_USE_HW_TAG;
 	dd->tags.driver_data = dd;
 	dd->tags.timeout = MTIP_NCQ_CMD_TIMEOUT_MS;
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 3/4] blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG
From: Ming Lei @ 2017-04-15 12:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Omar Sandoval, Jozef Mikovic, Ming Lei
In-Reply-To: <20170415123825.32716-1-ming.lei@redhat.com>

Some drivers, for example of mtip32xx, use the 'request_index'
passed to .init_request() as hardware tag index for initializing
hardware queue, and these drivers actually require that rq->tag
is always same with 'request_index' passed to .init_request().

After blk-mq I/O scheduler is in, the driver tag is allocated
during dispatching, and the allocated driver tag can't be same
with I/O scheduler's tag, so blk-mq I/O scheduler breaks these
devices, like mtip32xx.

This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG flag, and just
allocate hardware tag for scheduler directly, then we can address
mtip32xx's issue.

On the other hand, this feature should make blk-mq io scheduler
more efficient than current way if the hardware tag space is big
enough, because we can save one tag allocation/release.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 10 +++++++++-
 block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
 include/linux/blk-mq.h |  1 +
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9e3c0f92851b..1ff4b61135bc 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -83,7 +83,12 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
 	if (e) {
-		data->flags |= BLK_MQ_REQ_INTERNAL;
+		/*
+		 * If BLK_MQ_F_SCHED_USE_HW_TAG is set, we use hardware
+		 * tag as scheduler tag.
+		 */
+		if (!(data->hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG))
+			data->flags |= BLK_MQ_REQ_INTERNAL;
 
 		/*
 		 * Flush requests are special and go directly to the
@@ -445,6 +450,9 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	struct blk_mq_tag_set *set = q->tag_set;
 	int ret;
 
+	if (hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG)
+		return 0;
+
 	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
 					       set->reserved_tags);
 	if (!hctx->sched_tags)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e536dacfae4c..ac6245bdbc8c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -247,9 +247,19 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 				rq->rq_flags = RQF_MQ_INFLIGHT;
 				atomic_inc(&data->hctx->nr_active);
 			}
-			rq->tag = tag;
-			rq->internal_tag = -1;
-			data->hctx->tags->rqs[rq->tag] = rq;
+			data->hctx->tags->rqs[tag] = rq;
+
+			/*
+			 * If we use hw tag for scheduling, postpone setting
+			 * rq->tag in blk_mq_get_driver_tag().
+			 */
+			if (data->hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
+				rq->tag = -1;
+				rq->internal_tag = tag;
+			} else {
+				rq->tag = tag;
+				rq->internal_tag = -1;
+			}
 		}
 
 		blk_mq_rq_ctx_init(data->q, data->ctx, rq, op);
@@ -349,7 +359,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
+	if (sched_tag != -1 && !(hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG))
 		blk_mq_sched_completed_request(hctx, rq);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
@@ -866,6 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 	if (rq->tag != -1)
 		goto done;
 
+	/* we buffered driver tag in rq->internal_tag */
+	if (data.hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
+		rq->tag = rq->internal_tag;
+		goto done;
+	}
+
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
 
@@ -887,9 +903,15 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 				    struct request *rq)
 {
-	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+	unsigned tag = rq->tag;
+
 	rq->tag = -1;
 
+	if (hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG)
+		return;
+
+	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, tag);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
 		atomic_dec(&hctx->nr_active);
@@ -2852,7 +2874,8 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 		blk_flush_plug_list(plug, false);
 
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
-	if (!blk_qc_t_is_internal(cookie))
+	if (!blk_qc_t_is_internal(cookie) || (hctx->flags &
+			BLK_MQ_F_SCHED_USE_HW_TAG))
 		rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
 	else
 		rq = blk_mq_tag_to_rq(hctx->sched_tags, blk_qc_t_to_tag(cookie));
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b90c3d5766cd..be605a05f340 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -154,6 +154,7 @@ enum {
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
+	BLK_MQ_F_SCHED_USE_HW_TAG	= 1 << 7,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 2/4] mtip32xx: pass BLK_MQ_F_NO_SCHED
From: Ming Lei @ 2017-04-15 12:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Omar Sandoval, Jozef Mikovic, Ming Lei
In-Reply-To: <20170415123825.32716-1-ming.lei@redhat.com>

The recent introduced MQ IO scheduler breaks mtip32xx in the
following way.

mtip32xx use the 'request_index' passed to .init_request() as
hardware tag index for initializing hardware queue, and it
actually require that rq->tag is always same with 'request_index'
passed to .init_request(). Current blk-mq IO scheduler can't
guarantee this point, so this patch passes BLK_MQ_F_NO_SCHED
and at least make mtip32xx working.

This patch fixes the following strange hardware failure. The
issue can be triggered easily when doing I/O with mq-deadline
enabled.

[  186.972578] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
[  186.972578] {1}[Hardware Error]: event severity: fatal
[  186.972579] {1}[Hardware Error]:  Error 0, type: fatal
[  186.972580] {1}[Hardware Error]:   section_type: PCIe error
[  186.972580] {1}[Hardware Error]:   port_type: 0, PCIe end point
[  186.972581] {1}[Hardware Error]:   version: 1.0
[  186.972581] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[  186.972582] {1}[Hardware Error]:   device_id: 0000:07:00.0
[  186.972582] {1}[Hardware Error]:   slot: 4
[  186.972583] {1}[Hardware Error]:   secondary_bus: 0x00
[  186.972583] {1}[Hardware Error]:   vendor_id: 0x1344, device_id: 0x5150
[  186.972584] {1}[Hardware Error]:   class_code: 008001
[  186.972585] Kernel panic - not syncing: Fatal hardware error!

Reported-by: Jozef Mikovic <jmikovic@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/mtip32xx/mtip32xx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 05e3e664ea1b..4e344246c8dd 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3969,7 +3969,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 	dd->tags.reserved_tags = 1;
 	dd->tags.cmd_size = sizeof(struct mtip_cmd);
 	dd->tags.numa_node = dd->numa_node;
-	dd->tags.flags = BLK_MQ_F_SHOULD_MERGE;
+	dd->tags.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED;
 	dd->tags.driver_data = dd;
 	dd->tags.timeout = MTIP_NCQ_CMD_TIMEOUT_MS;
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 1/4] block: respect BLK_MQ_F_NO_SCHED
From: Ming Lei @ 2017-04-15 12:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Omar Sandoval, Jozef Mikovic, Ming Lei
In-Reply-To: <20170415123825.32716-1-ming.lei@redhat.com>

If one driver claims that it doesn't support io scheduler via
BLK_MQ_F_NO_SCHED, we should not allow to change and show the
availabe io schedulers.

This patch adds check to enhance this behaviour.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index dbeecf7be719..4d9084a14c10 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1098,12 +1098,20 @@ int elevator_change(struct request_queue *q, const char *name)
 }
 EXPORT_SYMBOL(elevator_change);
 
+static inline bool elv_support_iosched(struct request_queue *q)
+{
+	if (q->mq_ops && q->tag_set && (q->tag_set->flags &
+				BLK_MQ_F_NO_SCHED))
+		return false;
+	return true;
+}
+
 ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 			  size_t count)
 {
 	int ret;
 
-	if (!(q->mq_ops || q->request_fn))
+	if (!(q->mq_ops || q->request_fn) || !elv_support_iosched(q))
 		return count;
 
 	ret = __elevator_change(q, name);
@@ -1135,7 +1143,7 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
 			continue;
 		}
-		if (__e->uses_mq && q->mq_ops)
+		if (__e->uses_mq && q->mq_ops && elv_support_iosched(q))
 			len += sprintf(name+len, "%s ", __e->elevator_name);
 		else if (!__e->uses_mq && !q->mq_ops)
 			len += sprintf(name+len, "%s ", __e->elevator_name);
-- 
2.9.3

^ permalink raw reply related

* [PATCH 0/4] blk-mq-sched: allow to use hw tag for sched
From: Ming Lei @ 2017-04-15 12:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Omar Sandoval, Jozef Mikovic, Ming Lei

The 1st patch enhances BLK_MQ_F_NO_SCHED so that we can't change/
show available io schedulers on devices which don't support io
scheduler.

The 2nd patch passes BLK_MQ_F_NO_SCHED for avoiding one regression
on mtip32xx, which is introduced by blk-mq io scheduler.

The last two patches introduce BLK_MQ_F_SCHED_USE_HW_TAG so that
we can allow to use hardware tag for scheduler, then mq-deadline
can work well on mtip32xx. Even though other devices with enough
hardware tag space can benefit from this feature too.

The 1st two patches aims on v4.11, and the last two are for
v4.12.

Thanks,
Ming

Ming Lei (4):
  block: respect BLK_MQ_F_NO_SCHED
  mtip32xx: pass BLK_MQ_F_NO_SCHED
  blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG
  mtip32xx: use BLK_MQ_F_USE_SCHED_TAG

 block/blk-mq-sched.c              | 10 +++++++++-
 block/blk-mq.c                    | 35 +++++++++++++++++++++++++++++------
 block/elevator.c                  | 12 ++++++++++--
 drivers/block/mtip32xx/mtip32xx.c |  2 +-
 include/linux/blk-mq.h            |  1 +
 5 files changed, 50 insertions(+), 10 deletions(-)

-- 
2.9.3

^ permalink raw reply

* Re: Outstanding MQ questions from MMC
From: Ulf Hansson @ 2017-04-15 10:20 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Arnd Bergmann, linux-mmc@vger.kernel.org, linux-block,
	Adrian Hunter, Paolo Valente, Jens Axboe, Christoph Hellwig
In-Reply-To: <CACRpkdYsWnXLZEpZr74pLB8zjTV4Q=sdAfX+sz0n_E46SaJa9Q@mail.gmail.com>

[...]

>> Alternatively, I had this idea that we could translate blk requests into
>> mmc commands and then have a (short fixed length) set of outstanding
>> mmc commands in the device that always get done in order. The card
>> detect and the user space I/O would then directly put mmc commands
>> onto the command queue, as would the blk-mq scheduler. You
>> still need a lock to access that command queue, but the mmc host
>> would just always pick the next command off the list when one
>> command completes.
>
> I looked into this.
>
> The block layer queue can wrap and handle custom device commands
> using REQ_OP_DRV_IN/OUT, and that seems to be the best way
> to play with the block layer IMO.
>
> The card detect work is a special case because it is also used by
> SDIO which does not use the block layer. But that could maybe be
> solved by a separate host lock just for the SDIO case, letting
> devices accessed as block devices use the method of inserting
> custom commands.

The problem with trying to manage the SDIO case as a specific case, it
that it is the same work (mmc_rescan()) that runs to detect any kind
of removable card.

Moreover, it's not until the card has been fully detected and
initialized, when we can realize what kind of card it is.

Perhaps we can re-factor the hole mmc_rescan() thing so there is one
part that can be run only to detect new cards being inserted in
lockless fashion, while another part could deal with the
polling/removal - which then perhaps could be different depending on
the card type.

Not sure if this helps...

>
> I looked at how e.g. IDE and SCSI does this, drivers/ide/ide-ioctls.c
> looks like this nowadays:
>
> static int generic_drive_reset(ide_drive_t *drive)
> {
>         struct request *rq;
>         int ret = 0;
>
>         rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
>         scsi_req_init(rq);
>         ide_req(rq)->type = ATA_PRIV_MISC;
>         scsi_req(rq)->cmd_len = 1;
>         scsi_req(rq)->cmd[0] = REQ_DRIVE_RESET;
>         if (blk_execute_rq(drive->queue, NULL, rq, 1))
>                 ret = rq->errors;
>         blk_put_request(rq);
>         return ret;
> }
>
> So it creates a custom REQ_OP_DRV_IN request, then scsi_req_init()
> sets up the special command, in this case
> ATA_PRIV_MISC/REQ_DRIVE_RESET and toss this into the block
> queue like everything else.
>
> We could do the same, especially the RPMB operations should
> probably have been done like this from the beginning. But due to
> historical factors they were not.
>
> It is a bit hairy and the whole thing is in a bit of flux because Christoph
> is heavily refactoring this and cleaning up the old block devices as
> we speak (I bet) so it is a bit hard to do the right thing.
>
> I easily get confused here ... for example there is custom
> per-request data access by this simple:
>
> scsi_req_init(rq)
>
> which does
>
> struct scsi_request *req = scsi_req(rq);
>
> which does
>
> static inline struct scsi_request *scsi_req(struct request *rq)
> {
>         return blk_mq_rq_to_pdu(rq);
> }
>
> Oohps blk_mq_* namespace? You would assume this means that
> you have to use blk-mq? Nah, I think not, because all it does is:
>
> static inline void *blk_mq_rq_to_pdu(struct request *rq)
> {
>         return rq + 1;
> }
>
> So while we have to #include <linux/blk-mq.h> this is one of these
> mixed semantics that just give you a pointer to something behind
> the request, a method that is simple and natural in blk-mq but which
> is (I guess) set up by some other mechanism in the !mq case,
> albeit access by this inline.
>
> And I have to do this with the old block layer to get to a point
> where we can start using blk-mq, sigh.
>
> The border between blk and blk-mq is a bit blurry right now.
>
> With blk-mq I do this:
>
> mq->tag_set.cmd_size = sizeof(foo_cmd);
> blk_mq_alloc_tag_set(...)
>
> To do this with the old blk layer I may need some help to figure
> out how to set up per-request additional data in a way that works
> with the old layer.
>
> scsi_lib.c scsi_alloc_queue() does this:
>
> q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
> if (!q)
>       return NULL;
> q->cmd_size = sizeof(foo_cmd);
>
> And this means there will be sizeof(foo_cmd) after the request
> that can be dereferenced by blk_mq_rq_to_pdu(rq);...
>
> Yeah I'll try it.
>
> Just trying to give a picture of why it's a bit in flux here.
> Or documenting it for myself :D
>
>> This also lets you integrate packed commands: if the next outstanding
>> command is the same type as the request coming in from blk-mq,
>> you can merge it into a single mmc command to be submitted
>> together, otherwise it gets deferred.
>
> Essentially the heavy lifting that needs to happen is:
>
> 1. Start allocating per-request PDU (extra data) in the MMC case
>    this will then be struct mmc_queue_req request items.
>
> 2. Turn RPMB and other ioctl() MMC operations into mmc_queue_req
>    things and funnel them into the block scheduler
>    using REQ_OP_DRV_IN/OUT requests.
>
> 3. Turn the card detect into an mmc_queue_req as well
>
> 4. We can kill the big MMC host lock for block devices and
>    split off an SDIO-only host lock.
>
> I'm onto it ... I guess.

It looks hairy, but please have a try!

In the meantime, I will wrap my head around and try to see if we can
find a possible easier intermediate step.

Kind regards
Uffe

^ permalink raw reply

* Re: [PATCH 1/3] blk-mq: unify hctx delayed_run_work and run_work
From: Bart Van Assche @ 2017-04-14 20:56 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@fb.com; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <146849ab-b865-0ba7-b434-7101e013eafb@fb.com>

On Fri, 2017-04-14 at 14:02 -0600, Jens Axboe wrote:
> I was waiting for further comments on patch 3/3.

Hello Jens,

Patch 3/3 is probably fine but I hope that you understand that the introduc=
tion
of a new race condition does not make me enthusiast. Should your explanatio=
n of
why that race is harmless perhaps be added as a comment?

Bart.=

^ permalink raw reply

* Re: [PATCH block-tree] net: off by one in inet6_pton()
From: Jens Axboe @ 2017-04-14 20:09 UTC (permalink / raw)
  To: Dan Carpenter, David S. Miller, Sagi Grimberg
  Cc: Wei Tang, Alexey Dobriyan, netdev, linux-block, kernel-janitors
In-Reply-To: <20170413194231.GD591@mwanda>

On 04/13/2017 01:42 PM, Dan Carpenter wrote:
> If "scope_len" is sizeof(scope_id) then we would put the NUL terminator
> one space beyond the end of the buffer.

Added, thanks Dan.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v4 0/5] blk-mq: Kyber multiqueue I/O scheduler
From: Jens Axboe @ 2017-04-14 20:08 UTC (permalink / raw)
  To: Omar Sandoval, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1492156558.git.osandov@fb.com>

On 04/14/2017 01:59 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This is v4 of Kyber, an I/O scheduler for multiqueue devices combining
> several techniques: the scalable bitmap library, the new blk-stats API,
> and queue depth throttling similar to blk-wbt. v1 was here [1], v2 was
> here [2], v3 was here [3].
> 
> v4 fixes a hang in v3 caused by a race condition in the wait queue
> handling in kyber_get_domain_token().
> 
> This series is based on block/for-next. Patches 1 and 2 implement a new
> sbitmap operation. Patch 3 exports a couple of helpers. Patch 4 moves a
> scheduler callback to somewhere more useful. Patch 5 implements the new
> scheduler.

Added for 4.12, thanks Omar.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 1/3] blk-mq: unify hctx delayed_run_work and run_work
From: Jens Axboe @ 2017-04-14 20:02 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <1491933638.2654.12.camel@sandisk.com>

On 04/11/2017 12:00 PM, Bart Van Assche wrote:
> On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
>>  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
>>  {
>> -	cancel_work(&hctx->run_work);
>> +	cancel_delayed_work(&hctx->run_work);
>>  	cancel_delayed_work(&hctx->delay_work);
>>  	set_bit(BLK_MQ_S_STOPPED, &hctx->state);
>>  }
> 
> Hello Jens,
> 
> I would like to change the above cancel_*work() calls into cancel_*work_sync()
> calls because this code is used when e.g. switching between I/O schedulers and
> no .queue_rq() calls must be ongoing while switching between schedulers. Do you
> want to integrate that change into this patch or do you want me to post a
> separate patch? In the latter case, should I start from your for-next branch
> to develop that patch or from your for-next branch + this patch series?

I agree, we should make it _sync(). I'll just make the edit in the patch
when I send it out again. I was waiting for further comments on patch 3/3.

-- 
Jens Axboe

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox