[for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
@ 2017-12-19 21:05 Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 1/5] block: establish request failover callback infrastructure Mike Snitzer
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-19 21:05 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: hare, Bart.VanAssche, linux-block, linux-nvme, dm-devel

These patches enable DM multipath to work well on NVMe over Fabrics
devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.

But follow-on work will be to make it so that native NVMe multipath
and DM multipath can be made to co-exist (e.g. blacklisting certain
NVMe devices from being consumed by native NVMe multipath?)

Patch 1 updates block core to formalize a recent construct that
Christoph embeedded into NVMe core (and native NVMe multipath):
callback into a bio-based driver from the blk-mq driver's .complete
hook to blk_steal_bios() a request's bios.

Patch 2 switches NVMe over to using the block infrastructure
established by Patch 1.

Patch 3 moves the nvme_req_needs_failover() from NVMe multipath to
core.  Which allow sstacked devices (like DM multipath) to make use of
NVMe's enhanced error handling.

Patch 4 updates DM multipath to also make use of the block
infrastructure established by Patch 1.

Patch 5 can be largely ignored.. but it illustrates that Patch 1 - 4
enable DM multipath to avoid extra DM endio callbacks.

These patches have been developed ontop of numerous DM changes I've
staged for 4.16, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
(which happens to include these 5 patches at the end, purely for
interim linux-next coverage purposes as these changes land in the
appropriate maintainer tree).

I've updated the "mptest" DM multipath testsuite to provide NVMe test
coverage (using NVMe fcloop), see: https://github.com/snitm/mptest

The tree I've been testing includes all of 'dm-4.16' and all but one
of the commits from 'nvme-4.16', see:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-4.16_nvme-4.16
(I've let James Smart know that commit a0b69cc8 causes "nvme connect"
to not work on my fcloop testbed).

Jens, provided review is favorable, I'd very much appreciate it you'd
pick up patches 1 - 3 for 4.16.

Thanks,
Mike

Mike Snitzer (5):
  block: establish request failover callback infrastructure
  nvme: use request's failover callback for multipath failover
  nvme: move nvme_req_needs_failover() from multipath to core
  dm mpath: use NVMe error handling to know when an error is retryable
  dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin

 block/blk-core.c              |  2 ++
 block/blk-mq.c                |  1 +
 drivers/md/dm-mpath.c         | 68 ++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm-rq.c            |  8 ++---
 drivers/md/dm.c               |  7 +++--
 drivers/nvme/host/core.c      | 50 +++++++++++++++++++++++++++++--
 drivers/nvme/host/multipath.c | 52 ++-------------------------------
 drivers/nvme/host/nvme.h      | 14 ---------
 include/linux/blk_types.h     |  4 +++
 include/linux/blkdev.h        |  6 ++++
 include/linux/device-mapper.h |  6 ++++
 11 files changed, 145 insertions(+), 73 deletions(-)

-- 
2.15.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [for-4.16 PATCH 1/5] block: establish request failover callback infrastructure
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
@ 2017-12-19 21:05 ` Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 2/5] nvme: use request's failover callback for multipath failover Mike Snitzer
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-19 21:05 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: hare, Bart.VanAssche, linux-block, linux-nvme, dm-devel

If bio sets 'bi_failover_rq' callback it'll get transfered to the
appropriate request's 'failover_rq' using blk_init_request_from_bio().

This callback is expected to use the blk_steal_bios() interface to
transfer a request's bios back to a bio-based request_queue.

This will be used by both NVMe multipath and DM multipath.  Without it
DM multipath cannot get access to NVMe-specific error handling that NVMe
core provides in nvme_complete_rq().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c          | 2 ++
 block/blk-mq.c            | 1 +
 include/linux/blk_types.h | 4 ++++
 include/linux/blkdev.h    | 6 ++++++
 4 files changed, 13 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..95bdc4f2b11d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1872,6 +1872,8 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
 	else
 		req->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
 	req->write_hint = bio->bi_write_hint;
+	if (bio->bi_failover_rq)
+		req->failover_rq = bio->bi_failover_rq;
 	blk_rq_bio_prep(req->q, req, bio);
 }
 EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..bb52f8283f07 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -320,6 +320,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
+	rq->failover_rq = NULL;
 	rq->next_rq = NULL;
 
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1e628e032da..c3a952991814 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -18,6 +18,9 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
 
+struct request;
+typedef void (rq_failover_fn)(struct request *);
+
 /*
  * Block error status values.  See block/blk-core:blk_errors for the details.
  */
@@ -77,6 +80,7 @@ struct bio {
 	atomic_t		__bi_remaining;
 
 	bio_end_io_t		*bi_end_io;
+	rq_failover_fn		*bi_failover_rq;
 
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..46bcd782debe 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -237,6 +237,12 @@ struct request {
 	rq_end_io_fn *end_io;
 	void *end_io_data;
 
+	/*
+	 * callback to failover request's bios back to upper layer
+	 * bio-based queue using blk_steal_bios().
+	 */
+	rq_failover_fn *failover_rq;
+
 	/* for bidi */
 	struct request *next_rq;
 };
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [for-4.16 PATCH 2/5] nvme: use request's failover callback for multipath failover
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 1/5] block: establish request failover callback infrastructure Mike Snitzer
@ 2017-12-19 21:05 ` Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 3/5] nvme: move nvme_req_needs_failover() from multipath to core Mike Snitzer
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-19 21:05 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: hare, Bart.VanAssche, linux-block, linux-nvme, dm-devel

Also, remove NVMe-local REQ_NVME_MPATH since setting bio's
bi_failover_rq serves the same purpose: flag the request as being from
multipath (which NVMe's error handler then gives further consideration
about it being retryable).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c      | 4 ++--
 drivers/nvme/host/multipath.c | 7 ++++---
 drivers/nvme/host/nvme.h      | 9 ---------
 3 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..0c0a52fa78e4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -191,8 +191,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-		if (nvme_req_needs_failover(req)) {
-			nvme_failover_req(req);
+		if (req->failover_rq && nvme_req_needs_failover(req)) {
+			req->failover_rq(req);
 			return;
 		}
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1218a9fca846..0ab51c184df9 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -19,7 +19,7 @@ module_param(multipath, bool, 0644);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
 
-void nvme_failover_req(struct request *req)
+static void nvme_failover_req(struct request *req)
 {
 	struct nvme_ns *ns = req->q->queuedata;
 	unsigned long flags;
@@ -35,7 +35,7 @@ void nvme_failover_req(struct request *req)
 
 bool nvme_req_needs_failover(struct request *req)
 {
-	if (!(req->cmd_flags & REQ_NVME_MPATH))
+	if (!req->failover_rq)
 		return false;
 
 	switch (nvme_req(req)->status & 0x7ff) {
@@ -128,7 +128,8 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
 		bio->bi_disk = ns->disk;
-		bio->bi_opf |= REQ_NVME_MPATH;
+		/* Mark bio as coming in through the mpath node. */
+		bio->bi_failover_rq = nvme_failover_req;
 		ret = direct_make_request(bio);
 	} else if (!list_empty_careful(&head->list)) {
 		dev_warn_ratelimited(dev, "no path available - requeuing I/O\n");
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..5195b4850eb0 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -95,11 +95,6 @@ struct nvme_request {
 	u16			status;
 };
 
-/*
- * Mark a bio as coming in through the mpath node.
- */
-#define REQ_NVME_MPATH		REQ_DRV
-
 enum {
 	NVME_REQ_CANCELLED		= (1 << 0),
 };
@@ -400,7 +395,6 @@ extern const struct attribute_group nvme_ns_id_attr_group;
 extern const struct block_device_operations nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
-void nvme_failover_req(struct request *req);
 bool nvme_req_needs_failover(struct request *req);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
@@ -418,9 +412,6 @@ static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
 }
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 #else
-static inline void nvme_failover_req(struct request *req)
-{
-}
 static inline bool nvme_req_needs_failover(struct request *req)
 {
 	return false;
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [for-4.16 PATCH 3/5] nvme: move nvme_req_needs_failover() from multipath to core
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 1/5] block: establish request failover callback infrastructure Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 2/5] nvme: use request's failover callback for multipath failover Mike Snitzer
@ 2017-12-19 21:05 ` Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-19 21:05 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: hare, Bart.VanAssche, linux-block, linux-nvme, dm-devel

nvme_req_needs_failover() could be usable regardless of whether
CONFIG_NVME_MULTIPATH is set.

DM multipath will also make use of it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c      | 46 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c | 47 -------------------------------------------
 drivers/nvme/host/nvme.h      |  5 -----
 3 files changed, 46 insertions(+), 52 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0c0a52fa78e4..63691e251f8c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -188,6 +188,52 @@ static inline bool nvme_req_needs_retry(struct request *req)
 	return true;
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+	/* Caller verifies req->failover_rq is set */
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
+		return false;
+	}
+
+	/* Everything else could be a path failure, so should be retried */
+	return true;
+}
+
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0ab51c184df9..8df01a8d6c02 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -33,53 +33,6 @@ static void nvme_failover_req(struct request *req)
 	kblockd_schedule_work(&ns->head->requeue_work);
 }
 
-bool nvme_req_needs_failover(struct request *req)
-{
-	if (!req->failover_rq)
-		return false;
-
-	switch (nvme_req(req)->status & 0x7ff) {
-	/*
-	 * Generic command status:
-	 */
-	case NVME_SC_INVALID_OPCODE:
-	case NVME_SC_INVALID_FIELD:
-	case NVME_SC_INVALID_NS:
-	case NVME_SC_LBA_RANGE:
-	case NVME_SC_CAP_EXCEEDED:
-	case NVME_SC_RESERVATION_CONFLICT:
-		return false;
-
-	/*
-	 * I/O command set specific error.  Unfortunately these values are
-	 * reused for fabrics commands, but those should never get here.
-	 */
-	case NVME_SC_BAD_ATTRIBUTES:
-	case NVME_SC_INVALID_PI:
-	case NVME_SC_READ_ONLY:
-	case NVME_SC_ONCS_NOT_SUPPORTED:
-		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
-			nvme_fabrics_command);
-		return false;
-
-	/*
-	 * Media and Data Integrity Errors:
-	 */
-	case NVME_SC_WRITE_FAULT:
-	case NVME_SC_READ_ERROR:
-	case NVME_SC_GUARD_CHECK:
-	case NVME_SC_APPTAG_CHECK:
-	case NVME_SC_REFTAG_CHECK:
-	case NVME_SC_COMPARE_FAILED:
-	case NVME_SC_ACCESS_DENIED:
-	case NVME_SC_UNWRITTEN_BLOCK:
-		return false;
-	}
-
-	/* Everything else could be a path failure, so should be retried */
-	return true;
-}
-
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5195b4850eb0..130eed526e1f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -395,7 +395,6 @@ extern const struct attribute_group nvme_ns_id_attr_group;
 extern const struct block_device_operations nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
-bool nvme_req_needs_failover(struct request *req);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns_head *head);
@@ -412,10 +411,6 @@ static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
 }
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 #else
-static inline bool nvme_req_needs_failover(struct request *req)
-{
-	return false;
-}
 static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 }
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (2 preceding siblings ...)
  2017-12-19 21:05 ` [for-4.16 PATCH 3/5] nvme: move nvme_req_needs_failover() from multipath to core Mike Snitzer
@ 2017-12-19 21:05 ` Mike Snitzer
  2017-12-20 16:58   ` Mike Snitzer
  2017-12-19 21:05 ` [for-4.16 PATCH 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin Mike Snitzer
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2017-12-19 21:05 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: hare, Bart.VanAssche, linux-block, linux-nvme, dm-devel

Like NVMe's native multipath support, DM multipath's NVMe bio-based
support now allows NVMe core's error handling to requeue an NVMe blk-mq
request's bios onto DM multipath's queued_bios list for resubmission
once fail_path() occurs.  multipath_failover_rq() serves as a
replacement for the traditional multipath_end_io_bio().

DM multipath's bio submission to NVMe must be done in terms that allow
the reuse of NVMe core's error handling.  The following care is taken to
realize this reuse:

- NVMe core won't attempt to retry an IO if it has
  REQ_FAILFAST_TRANSPORT set; so only set it in __map_bio().

- Setup bio's bi_failover_rq hook, to use multipath_failover_rq, so that
  NVMe blk-mq requests inherit it for use as the failover_rq callback
  if/when NVMe core determines a request must be retried.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3198093006e4..0ed407e150f5 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -584,9 +584,13 @@ static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
 		return ERR_PTR(-EAGAIN);
 	}
 
+	bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
+
 	return pgpath;
 }
 
+static void multipath_failover_rq(struct request *rq);
+
 static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio)
 {
 	struct pgpath *pgpath;
@@ -614,6 +618,8 @@ static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio)
 		return NULL;
 	}
 
+	bio->bi_failover_rq = multipath_failover_rq;
+
 	return pgpath;
 }
 
@@ -641,7 +647,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
 
 	bio->bi_status = 0;
 	bio_set_dev(bio, pgpath->path.dev->bdev);
-	bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
 
 	if (pgpath->pg->ps.type->start_io)
 		pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
@@ -1610,6 +1615,14 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
 	unsigned long flags;
 	int r = DM_ENDIO_DONE;
 
+	/*
+	 * NVMe bio-based only needs to update path selector (on
+	 * success or errors that NVMe deemed non-retryable)
+	 * - retryable errors are handled by multipath_failover_rq
+	 */
+	if (clone->bi_failover_rq)
+		goto done;
+
 	if (!*error || !retry_error(*error))
 		goto done;
 
@@ -1645,6 +1658,43 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
 	return r;
 }
 
+/*
+ * multipath_failover_rq serves as a replacement for multipath_end_io_bio
+ * for all bios in a request with a retryable error.
+ */
+static void multipath_failover_rq(struct request *rq)
+{
+	struct dm_target *ti = dm_bio_get_target(rq->bio);
+	struct multipath *m = ti->private;
+	struct dm_mpath_io *mpio = get_mpio_from_bio(rq->bio);
+	struct pgpath *pgpath = mpio->pgpath;
+	unsigned long flags;
+
+	if (pgpath) {
+		struct path_selector *ps = &pgpath->pg->ps;
+
+		if (ps->type->end_io)
+			ps->type->end_io(ps, &pgpath->path, blk_rq_bytes(rq));
+
+		fail_path(pgpath);
+	}
+
+	if (atomic_read(&m->nr_valid_paths) == 0 &&
+	    !test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) &&
+	    !must_push_back_bio(m)) {
+		dm_report_EIO(m);
+		blk_mq_end_request(rq, BLK_STS_IOERR);
+		return;
+	}
+
+	spin_lock_irqsave(&m->lock, flags);
+	blk_steal_bios(&m->queued_bios, rq);
+	spin_unlock_irqrestore(&m->lock, flags);
+	queue_work(kmultipathd, &m->process_queued_bios);
+
+	blk_mq_end_request(rq, 0);
+}
+
 /*
  * Suspend can't complete until all the I/O is processed so if
  * the last path fails we must error any remaining I/O.
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [for-4.16 PATCH 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (3 preceding siblings ...)
  2017-12-19 21:05 ` [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
@ 2017-12-19 21:05 ` Mike Snitzer
  2017-12-22 18:02 ` [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
  2017-12-26 20:51 ` Keith Busch
  6 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-19 21:05 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: hare, Bart.VanAssche, linux-block, linux-nvme, dm-devel

Add a 'skip_endio_hook' flag member to 'struct dm_target' that if set
instructs calls to .end_io (or .rq_end_io) to be skipped.

NVMe bio-based doesn't use multipath_end_io_bio() for anything other
than updating the path-selector.  So it can be avoided completely if the
round-robin path selector is used (because round-robin doesn't have an
end_io hook).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c         | 24 ++++++++++++++++++++----
 drivers/md/dm-rq.c            |  8 ++++----
 drivers/md/dm.c               |  7 ++++---
 include/linux/device-mapper.h |  6 ++++++
 4 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 0ed407e150f5..5b4c88c1980f 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1186,6 +1186,21 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
 		goto bad;
 	}
 
+	/*
+	 * If NVMe bio-based and all path selectors don't provide .end_io hook:
+	 * inform DM core that there is no need to call this target's end_io hook.
+	 */
+	if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
+		struct priority_group *pg;
+		if (!m->nr_priority_groups)
+			goto finish;
+		list_for_each_entry(pg, &m->priority_groups, list) {
+			if (pg->ps.type->end_io)
+				goto finish;
+		}
+		ti->skip_end_io_hook = true;
+	}
+finish:
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
@@ -1671,11 +1686,12 @@ static void multipath_failover_rq(struct request *rq)
 	unsigned long flags;
 
 	if (pgpath) {
-		struct path_selector *ps = &pgpath->pg->ps;
-
-		if (ps->type->end_io)
-			ps->type->end_io(ps, &pgpath->path, blk_rq_bytes(rq));
+		if (!ti->skip_end_io_hook) {
+			struct path_selector *ps = &pgpath->pg->ps;
 
+			if (ps->type->end_io)
+				ps->type->end_io(ps, &pgpath->path, blk_rq_bytes(rq));
+		}
 		fail_path(pgpath);
 	}
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9d32f25489c2..64206743da8f 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -285,12 +285,12 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
 {
 	int r = DM_ENDIO_DONE;
 	struct dm_rq_target_io *tio = clone->end_io_data;
-	dm_request_endio_fn rq_end_io = NULL;
+	struct dm_target *ti = tio->ti;
 
-	if (tio->ti) {
-		rq_end_io = tio->ti->type->rq_end_io;
+	if (ti) {
+		dm_request_endio_fn rq_end_io = ti->type->rq_end_io;
 
-		if (mapped && rq_end_io)
+		if (mapped && rq_end_io && !ti->skip_end_io_hook)
 			r = rq_end_io(tio->ti, clone, error, &tio->info);
 	}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9f4c4a7fd40d..5c83d9dcbfe8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -937,7 +937,8 @@ static void clone_endio(struct bio *bio)
 	struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
 	struct dm_io *io = tio->io;
 	struct mapped_device *md = tio->io->md;
-	dm_endio_fn endio = tio->ti->type->end_io;
+	struct dm_target *ti = tio->ti;
+	dm_endio_fn endio = ti->type->end_io;
 
 	if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
 		if (bio_op(bio) == REQ_OP_WRITE_SAME &&
@@ -948,8 +949,8 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
-	if (endio) {
-		int r = endio(tio->ti, bio, &error);
+	if (endio && !ti->skip_end_io_hook) {
+		int r = endio(ti, bio, &error);
 		switch (r) {
 		case DM_ENDIO_REQUEUE:
 			error = BLK_STS_DM_REQUEUE;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index e46ad2ada674..1dfb75ca5d09 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -307,6 +307,12 @@ struct dm_target {
 	 * on max_io_len boundary.
 	 */
 	bool split_discard_bios:1;
+
+	/*
+	 * Set if there is no need to call this target's end_io hook
+	 * (be it .end_io or .end_io_rq).
+	 */
+	bool skip_end_io_hook:1;
 };
 
 /* Each target can link one of these into the table */
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable
  2017-12-19 21:05 ` [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
@ 2017-12-20 16:58   ` Mike Snitzer
  2017-12-20 20:33     ` [dm-devel] " Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2017-12-20 16:58 UTC (permalink / raw)
  To: axboe, hch, emilne, james.smart
  Cc: Bart.VanAssche, linux-block, dm-devel, linux-nvme

On Tue, Dec 19 2017 at  4:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> Like NVMe's native multipath support, DM multipath's NVMe bio-based
> support now allows NVMe core's error handling to requeue an NVMe blk-mq
> request's bios onto DM multipath's queued_bios list for resubmission
> once fail_path() occurs.  multipath_failover_rq() serves as a
> replacement for the traditional multipath_end_io_bio().
> 
> DM multipath's bio submission to NVMe must be done in terms that allow
> the reuse of NVMe core's error handling.  The following care is taken to
> realize this reuse:
> 
> - NVMe core won't attempt to retry an IO if it has
>   REQ_FAILFAST_TRANSPORT set; so only set it in __map_bio().
> 
> - Setup bio's bi_failover_rq hook, to use multipath_failover_rq, so that
>   NVMe blk-mq requests inherit it for use as the failover_rq callback
>   if/when NVMe core determines a request must be retried.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

But interestingly, with my "mptest" link failure test
(test_01_nvme_offline) I'm not actually seeing NVMe trigger a failure
that needs a multipath layer (be it NVMe multipath or DM multipath) to
fail a path and retry the IO.  The pattern is that the link goes down,
and nvme waits for it to come back (internalizing any failure) and then
the IO continues.. so no multipath _really_ needed:

[55284.011286] nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect
[55284.020078] nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting Reconnect
[55284.028872] nvme nvme2: NVME-FC{2}: controller connectivity lost. Awaiting Reconnect
[55284.037658] nvme nvme3: NVME-FC{3}: controller connectivity lost. Awaiting Reconnect
[55295.157773] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[55295.157775] nvmet: ctrl 4 keep-alive timer (15 seconds) expired!
[55295.157778] nvmet: ctrl 3 keep-alive timer (15 seconds) expired!
[55295.157780] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
[55295.157781] nvmet: ctrl 4 fatal error occurred!
[55295.157784] nvmet: ctrl 3 fatal error occurred!
[55295.157785] nvmet: ctrl 2 fatal error occurred!
[55295.199816] nvmet: ctrl 1 fatal error occurred!
[55304.047540] nvme nvme0: NVME-FC{0}: connectivity re-established. Attempting reconnect
[55304.056533] nvme nvme1: NVME-FC{1}: connectivity re-established. Attempting reconnect
[55304.066053] nvme nvme2: NVME-FC{2}: connectivity re-established. Attempting reconnect
[55304.075037] nvme nvme3: NVME-FC{3}: connectivity re-established. Attempting reconnect
[55304.373776] nvmet: creating controller 1 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[55304.373835] nvmet: creating controller 2 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[55304.373873] nvmet: creating controller 3 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[55304.373879] nvmet: creating controller 4 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[55304.430988] nvme nvme0: NVME-FC{0}: controller reconnect complete
[55304.433124] nvme nvme3: NVME-FC{3}: controller reconnect complete
[55304.433705] nvme nvme1: NVME-FC{1}: controller reconnect complete

It seems if we have multipath ontop (again: either NVMe native multipath
_or_ DM multipath) we'd prefer to have the equivalent of SCSI's
REQ_FAILFAST_TRANSPORT support?

But nvme_req_needs_retry() calls blk_noretry_request() which returns
true if REQ_FAILFAST_TRANSPORT is set.  Which results in
nvme_req_needs_retry() returning false.  Which causes nvme_complete_rq()
to skip the multipath specific nvme_req_needs_failover(), etc.

So all said:

1) why wait for connection recovery if we have other connections to try?
I think NVMe needs to be plumbed for respecting REQ_FAILFAST_TRANSPORT.

2) this avoidance of NVMe retries, or failover, in response to
REQ_FAILFAST_TRANSPORT seems exactly the opposite of desired behaviour
when multipath is available?

Thoughts?

Thanks,
Mike

p.s. given NVMe FC transport isn't letting go of the IO, not failing up
to NVMe core while "awaiting reconnect".. even if "keep-alive timer (15
seconds) expired": I hacked test coverage to verify the correctness and
stability of this patch's multipath_failover_rq() hook by testing
against a dm-flakey device (up 10 sec, down 10 sec) and applying this
patch:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 5b4c88c..d6df7010 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1685,6 +1685,8 @@ static void multipath_failover_rq(struct request *rq)
 	struct pgpath *pgpath = mpio->pgpath;
 	unsigned long flags;

+	WARN_ON_ONCE(1);
+
 	if (pgpath) {
 		if (!ti->skip_end_io_hook) {
 			struct path_selector *ps = &pgpath->pg->ps;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index ad4ac29..3b8bc20 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -927,6 +927,7 @@ static int dm_table_determine_type(struct dm_table *t)
 		if (t->type == DM_TYPE_BIO_BASED)
 			return 0;
 		else if (t->type == DM_TYPE_NVME_BIO_BASED) {
+			return 0;
 			if (!dm_table_does_not_support_partial_completion(t)) {
 				DMERR("nvme bio-based is only possible with devices"
 				      " that don't support partial completion");
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 592a018..da88e4c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -182,7 +182,7 @@ static inline bool nvme_req_needs_retry(struct request *req)
 	if (blk_noretry_request(req))
 		return false;
 	if (nvme_req(req)->status & NVME_SC_DNR)
-		return false;
+		return true;
 	if (nvme_req(req)->retries >= nvme_max_retries)
 		return false;
 	return true;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [dm-devel] [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable
  2017-12-20 16:58   ` Mike Snitzer
@ 2017-12-20 20:33     ` Sagi Grimberg
  0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2017-12-20 20:33 UTC (permalink / raw)
  To: Mike Snitzer, axboe, hch, emilne, james.smart
  Cc: Bart.VanAssche, linux-block, dm-devel, linux-nvme


> But interestingly, with my "mptest" link failure test
> (test_01_nvme_offline) I'm not actually seeing NVMe trigger a failure
> that needs a multipath layer (be it NVMe multipath or DM multipath) to
> fail a path and retry the IO.  The pattern is that the link goes down,
> and nvme waits for it to come back (internalizing any failure) and then
> the IO continues.. so no multipath _really_ needed:
> 
> [55284.011286] nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect
> [55284.020078] nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting Reconnect
> [55284.028872] nvme nvme2: NVME-FC{2}: controller connectivity lost. Awaiting Reconnect
> [55284.037658] nvme nvme3: NVME-FC{3}: controller connectivity lost. Awaiting Reconnect
> [55295.157773] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [55295.157775] nvmet: ctrl 4 keep-alive timer (15 seconds) expired!
> [55295.157778] nvmet: ctrl 3 keep-alive timer (15 seconds) expired!
> [55295.157780] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
> [55295.157781] nvmet: ctrl 4 fatal error occurred!
> [55295.157784] nvmet: ctrl 3 fatal error occurred!
> [55295.157785] nvmet: ctrl 2 fatal error occurred!
> [55295.199816] nvmet: ctrl 1 fatal error occurred!
> [55304.047540] nvme nvme0: NVME-FC{0}: connectivity re-established. Attempting reconnect
> [55304.056533] nvme nvme1: NVME-FC{1}: connectivity re-established. Attempting reconnect
> [55304.066053] nvme nvme2: NVME-FC{2}: connectivity re-established. Attempting reconnect
> [55304.075037] nvme nvme3: NVME-FC{3}: connectivity re-established. Attempting reconnect
> [55304.373776] nvmet: creating controller 1 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
> [55304.373835] nvmet: creating controller 2 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
> [55304.373873] nvmet: creating controller 3 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
> [55304.373879] nvmet: creating controller 4 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
> [55304.430988] nvme nvme0: NVME-FC{0}: controller reconnect complete
> [55304.433124] nvme nvme3: NVME-FC{3}: controller reconnect complete
> [55304.433705] nvme nvme1: NVME-FC{1}: controller reconnect complete
> 
> It seems if we have multipath ontop (again: either NVMe native multipath
> _or_ DM multipath) we'd prefer to have the equivalent of SCSI's
> REQ_FAILFAST_TRANSPORT support?
> 
> But nvme_req_needs_retry() calls blk_noretry_request() which returns
> true if REQ_FAILFAST_TRANSPORT is set.  Which results in
> nvme_req_needs_retry() returning false.  Which causes nvme_complete_rq()
> to skip the multipath specific nvme_req_needs_failover(), etc.
> 
> So all said:
> 
> 1) why wait for connection recovery if we have other connections to try?
> I think NVMe needs to be plumbed for respecting REQ_FAILFAST_TRANSPORT.

This is specific to FC fail fast logic, nvme-rdma will fail inflight
commands as soon as the transport see an error (or keep alive timeout
expires).

It seems that FC wants to wait for the request retries counter to exceed
but given that the queue isn't unquiesced, the requests are quiesced
until the host will successfully reconnect.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (4 preceding siblings ...)
  2017-12-19 21:05 ` [for-4.16 PATCH 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin Mike Snitzer
@ 2017-12-22 18:02 ` Mike Snitzer
  2017-12-26 20:51 ` Keith Busch
  6 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-22 18:02 UTC (permalink / raw)
  To: hch, axboe
  Cc: emilne, james.smart, Bart.VanAssche, linux-block, dm-devel, hare,
	linux-nvme

On Tue, Dec 19 2017 at  4:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> These patches enable DM multipath to work well on NVMe over Fabrics
> devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.
> 
> But follow-on work will be to make it so that native NVMe multipath
> and DM multipath can be made to co-exist (e.g. blacklisting certain
> NVMe devices from being consumed by native NVMe multipath?)
> 
> Patch 1 updates block core to formalize a recent construct that
> Christoph embeedded into NVMe core (and native NVMe multipath):
> callback into a bio-based driver from the blk-mq driver's .complete
> hook to blk_steal_bios() a request's bios.
> 
> Patch 2 switches NVMe over to using the block infrastructure
> established by Patch 1.
> 
> Patch 3 moves the nvme_req_needs_failover() from NVMe multipath to
> core.  Which allow sstacked devices (like DM multipath) to make use of
> NVMe's enhanced error handling.
> 
> Patch 4 updates DM multipath to also make use of the block
> infrastructure established by Patch 1.
> 
> Patch 5 can be largely ignored.. but it illustrates that Patch 1 - 4
> enable DM multipath to avoid extra DM endio callbacks.
> 
> These patches have been developed ontop of numerous DM changes I've
> staged for 4.16, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> (which happens to include these 5 patches at the end, purely for
> interim linux-next coverage purposes as these changes land in the
> appropriate maintainer tree).
> 
> I've updated the "mptest" DM multipath testsuite to provide NVMe test
> coverage (using NVMe fcloop), see: https://github.com/snitm/mptest
> 
> The tree I've been testing includes all of 'dm-4.16' and all but one
> of the commits from 'nvme-4.16', see:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-4.16_nvme-4.16
> (I've let James Smart know that commit a0b69cc8 causes "nvme connect"
> to not work on my fcloop testbed).
> 
> Jens, provided review is favorable, I'd very much appreciate it you'd
> pick up patches 1 - 3 for 4.16.

BTW, Christoph if you're open to picking up patches 1 - 3 into
'nvme-4.16' that works too.  I just figured since there is a block core
dependency Jens would want to take them direct.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (5 preceding siblings ...)
  2017-12-22 18:02 ` [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
@ 2017-12-26 20:51 ` Keith Busch
  2017-12-27  2:42   ` Mike Snitzer
  6 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2017-12-26 20:51 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, hch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

On Tue, Dec 19, 2017 at 04:05:41PM -0500, Mike Snitzer wrote:
> These patches enable DM multipath to work well on NVMe over Fabrics
> devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.
> 
> But follow-on work will be to make it so that native NVMe multipath
> and DM multipath can be made to co-exist (e.g. blacklisting certain
> NVMe devices from being consumed by native NVMe multipath?)

Hi Mike,

I've reviewed the series and I support with the goal. I'm not a big fan,
though, of having yet-another-field to set in bio and req on each IO.

Unless I'm missing something, I think we can make this simpler if you add
the new 'failover_req_fn' as an attribute of the struct request_queue
instead of threading it through bio and request. Native nvme multipath
can set the field directly in the nvme driver, and dm-mpath can set it
in each path when not using the nvme mpath. What do you think?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2017-12-26 20:51 ` Keith Busch
@ 2017-12-27  2:42   ` Mike Snitzer
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2017-12-27  2:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, hch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

On Tue, Dec 26 2017 at  3:51pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Tue, Dec 19, 2017 at 04:05:41PM -0500, Mike Snitzer wrote:
> > These patches enable DM multipath to work well on NVMe over Fabrics
> > devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.
> > 
> > But follow-on work will be to make it so that native NVMe multipath
> > and DM multipath can be made to co-exist (e.g. blacklisting certain
> > NVMe devices from being consumed by native NVMe multipath?)
> 
> Hi Mike,
> 
> I've reviewed the series and I support with the goal. I'm not a big fan,
> though, of having yet-another-field to set in bio and req on each IO.

Yeah, I knew they'd be the primary sticking point for this patchset.
I'm not loving the need to carry the function pointer around either.

> Unless I'm missing something, I think we can make this simpler if you add
> the new 'failover_req_fn' as an attribute of the struct request_queue
> instead of threading it through bio and request. Native nvme multipath
> can set the field directly in the nvme driver, and dm-mpath can set it
> in each path when not using the nvme mpath. What do you think?

I initially didn't like the gotchas associated [1], but I worked through
them.

I'll post v2 after some testing.

Thanks,
Mike

[1]:
With DM multipath, it is easy to reliably establish the function
pointer.  But clearing it on teardown is awkward.. because another DM
multipath table may have already taken a reference on the device (as
could happen when reloading the multipath table associated with the DM
multipath device).  You are left with a scenario where a new table load 
would set it, but teardown wouldn't easily know if it should be cleared.
And not clearing it could easily lead to dereferencing stale memory
(if/when DM multipath driver were unloaded yet NVMe request_queue
outliving it).

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-12-27  2:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-19 21:05 [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
2017-12-19 21:05 ` [for-4.16 PATCH 1/5] block: establish request failover callback infrastructure Mike Snitzer
2017-12-19 21:05 ` [for-4.16 PATCH 2/5] nvme: use request's failover callback for multipath failover Mike Snitzer
2017-12-19 21:05 ` [for-4.16 PATCH 3/5] nvme: move nvme_req_needs_failover() from multipath to core Mike Snitzer
2017-12-19 21:05 ` [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
2017-12-20 16:58   ` Mike Snitzer
2017-12-20 20:33     ` [dm-devel] " Sagi Grimberg
2017-12-19 21:05 ` [for-4.16 PATCH 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin Mike Snitzer
2017-12-22 18:02 ` [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
2017-12-26 20:51 ` Keith Busch
2017-12-27  2:42   ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).