[for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
@ 2017-12-27  3:22 Mike Snitzer
  2017-12-27  3:22 ` [for-4.16 PATCH v2 1/5] block: establish request failover callback Mike Snitzer
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Mike Snitzer @ 2017-12-27  3:22 UTC (permalink / raw)
  To: axboe, hch, keith.busch
  Cc: emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

This v2 follows Keith's suggestion to only add a 'failover_rq_fn' to
the request_queue (rather than both the bio and request structs).

Jens or Christoph, provided review is favorable, I'd very much
appreciate it you'd pick up patches 1 - 3 for 4.16.

These patches enable DM multipath to work well on NVMe over Fabrics
devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.

But follow-on work will be to make it so that native NVMe multipath
and DM multipath can be made to co-exist (e.g. blacklisting certain
NVMe devices from being consumed by native NVMe multipath?)

Patch 1 updates block core to formalize a recent construct that
Christoph embeedded into NVMe core (and native NVMe multipath):
callback into a bio-based driver from the blk-mq driver's .complete
hook to blk_steal_bios() a request's bios.

Patch 2 switches NVMe over to using the request_queue's
'failover_rq_fn' established by Patch 1.

Patch 3 moves the nvme_req_needs_failover() from NVMe multipath to
core.  Which allows stacked devices (like DM multipath) to make use of
NVMe's enhanced error handling.

Patch 4 updates DM multipath to also make use of the request_queue's
'failover_rq_fn' established by Patch 1.

Patch 5 can be largely ignored.. but it illustrates that Patch 1 - 4
enable DM multipath to avoid extra DM endio callbacks.

These patches have been developed ontop of numerous DM changes I've
staged for 4.16, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
(which happens to include these 5 patches at the end, purely for
interim linux-next coverage purposes as these changes land in the
appropriate maintainer tree).

I've updated the "mptest" DM multipath testsuite to provide NVMe test
coverage (using NVMe fcloop), see: https://github.com/snitm/mptest

The tree I've been testing includes all of 'dm-4.16' and all but one
of the commits from 'nvme-4.16', see:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-4.16_nvme-4.16
(James Smart has since sent a fix for commit a0b69cc8)

Thanks,
Mike

Mike Snitzer (5):
  block: establish request failover callback
  nvme: use request_queue's failover_rq_fn callback for multipath failover
  nvme: move nvme_req_needs_failover() from multipath to core
  dm mpath: use NVMe error handling to know when an error is retryable
  dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin

 drivers/md/dm-mpath.c         | 87 +++++++++++++++++++++++++++++++++++++++++--
 drivers/md/dm-rq.c            |  8 ++--
 drivers/md/dm-table.c         |  2 +
 drivers/md/dm.c               |  7 ++--
 drivers/nvme/host/core.c      | 51 ++++++++++++++++++++++++-
 drivers/nvme/host/multipath.c | 48 ------------------------
 drivers/nvme/host/nvme.h      | 10 -----
 include/linux/blkdev.h        |  6 +++
 include/linux/device-mapper.h |  9 +++++
 9 files changed, 158 insertions(+), 70 deletions(-)

-- 
2.15.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [for-4.16 PATCH v2 1/5] block: establish request failover callback
  2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
@ 2017-12-27  3:22 ` Mike Snitzer
  2017-12-29 10:10   ` Christoph Hellwig
  2017-12-27  3:22 ` [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover Mike Snitzer
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2017-12-27  3:22 UTC (permalink / raw)
  To: axboe, hch, keith.busch
  Cc: emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

All requests allocated from a request_queue with this callback set can
failover their requests during completion.

This callback is expected to use the blk_steal_bios() interface to
transfer a request's bios back to an upper-layer bio-based
request_queue.

This will be used by both NVMe multipath and DM multipath.  Without it
DM multipath cannot get access to NVMe-specific error handling that NVMe
core provides in nvme_complete_rq().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..f45f5925e100 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -278,6 +278,7 @@ typedef int (lld_busy_fn) (struct request_queue *q);
 typedef int (bsg_job_fn) (struct bsg_job *);
 typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
 typedef void (exit_rq_fn)(struct request_queue *, struct request *);
+typedef void (failover_rq_fn)(struct request *);
 
 enum blk_eh_timer_return {
 	BLK_EH_NOT_HANDLED,
@@ -423,6 +424,11 @@ struct request_queue {
 	exit_rq_fn		*exit_rq_fn;
 	/* Called from inside blk_get_request() */
 	void (*initialize_rq_fn)(struct request *rq);
+	/*
+	 * Callback to failover request's bios back to upper layer
+	 * bio-based request_queue using blk_steal_bios().
+	 */
+	failover_rq_fn		*failover_rq_fn;
 
 	const struct blk_mq_ops	*mq_ops;
 
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover
  2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
  2017-12-27  3:22 ` [for-4.16 PATCH v2 1/5] block: establish request failover callback Mike Snitzer
@ 2017-12-27  3:22 ` Mike Snitzer
  2017-12-29 10:11   ` Christoph Hellwig
  2017-12-27  3:22 ` [for-4.16 PATCH v2 3/5] nvme: move nvme_req_needs_failover() from multipath to core Mike Snitzer
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2017-12-27  3:22 UTC (permalink / raw)
  To: axboe, hch, keith.busch
  Cc: emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

Also, remove NVMe-local REQ_NVME_MPATH since setting request_queue's
failover_rq_fn serves the same purpose.  NVMe's error handling gives a
request further consideration about it being retryable if its
request_queue has failover_rq_fn set.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c      | 5 +++--
 drivers/nvme/host/multipath.c | 4 +---
 drivers/nvme/host/nvme.h      | 5 -----
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..99f1c7d8514e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -191,8 +191,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-		if (nvme_req_needs_failover(req)) {
-			nvme_failover_req(req);
+		if (req->q->failover_rq_fn && nvme_req_needs_failover(req)) {
+			req->q->failover_rq_fn(req);
 			return;
 		}
 
@@ -2893,6 +2893,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
 				ctrl->cntlid, ns->head->instance);
 		flags = GENHD_FL_HIDDEN;
+		ns->queue->failover_rq_fn = nvme_failover_req;
 	} else {
 		sprintf(disk_name, "nvme%dn%d", ctrl->subsys->instance,
 				ns->head->instance);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1218a9fca846..b576942b96b2 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -35,8 +35,7 @@ void nvme_failover_req(struct request *req)
 
 bool nvme_req_needs_failover(struct request *req)
 {
-	if (!(req->cmd_flags & REQ_NVME_MPATH))
-		return false;
+	/* Caller verifies req->q->failover_rq_fn is set */
 
 	switch (nvme_req(req)->status & 0x7ff) {
 	/*
@@ -128,7 +127,6 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
 		bio->bi_disk = ns->disk;
-		bio->bi_opf |= REQ_NVME_MPATH;
 		ret = direct_make_request(bio);
 	} else if (!list_empty_careful(&head->list)) {
 		dev_warn_ratelimited(dev, "no path available - requeuing I/O\n");
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..5ec44935690f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -95,11 +95,6 @@ struct nvme_request {
 	u16			status;
 };
 
-/*
- * Mark a bio as coming in through the mpath node.
- */
-#define REQ_NVME_MPATH		REQ_DRV
-
 enum {
 	NVME_REQ_CANCELLED		= (1 << 0),
 };
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [for-4.16 PATCH v2 3/5] nvme: move nvme_req_needs_failover() from multipath to core
  2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
  2017-12-27  3:22 ` [for-4.16 PATCH v2 1/5] block: establish request failover callback Mike Snitzer
  2017-12-27  3:22 ` [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover Mike Snitzer
@ 2017-12-27  3:22 ` Mike Snitzer
  2017-12-27  3:22 ` [for-4.16 PATCH v2 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2017-12-27  3:22 UTC (permalink / raw)
  To: axboe, hch, keith.busch
  Cc: emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

nvme_req_needs_failover() could be used regardless of whether
CONFIG_NVME_MULTIPATH is set.

DM multipath will also make use of it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c      | 46 +++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c | 46 -------------------------------------------
 drivers/nvme/host/nvme.h      |  5 -----
 3 files changed, 46 insertions(+), 51 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99f1c7d8514e..381d19c9e431 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -188,6 +188,52 @@ static inline bool nvme_req_needs_retry(struct request *req)
 	return true;
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+	/* Caller verifies req->q->failover_rq_fn is set */
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
+		return false;
+	}
+
+	/* Everything else could be a path failure, so should be retried */
+	return true;
+}
+
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index b576942b96b2..9cc6b80f3530 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -33,52 +33,6 @@ void nvme_failover_req(struct request *req)
 	kblockd_schedule_work(&ns->head->requeue_work);
 }
 
-bool nvme_req_needs_failover(struct request *req)
-{
-	/* Caller verifies req->q->failover_rq_fn is set */
-
-	switch (nvme_req(req)->status & 0x7ff) {
-	/*
-	 * Generic command status:
-	 */
-	case NVME_SC_INVALID_OPCODE:
-	case NVME_SC_INVALID_FIELD:
-	case NVME_SC_INVALID_NS:
-	case NVME_SC_LBA_RANGE:
-	case NVME_SC_CAP_EXCEEDED:
-	case NVME_SC_RESERVATION_CONFLICT:
-		return false;
-
-	/*
-	 * I/O command set specific error.  Unfortunately these values are
-	 * reused for fabrics commands, but those should never get here.
-	 */
-	case NVME_SC_BAD_ATTRIBUTES:
-	case NVME_SC_INVALID_PI:
-	case NVME_SC_READ_ONLY:
-	case NVME_SC_ONCS_NOT_SUPPORTED:
-		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
-			nvme_fabrics_command);
-		return false;
-
-	/*
-	 * Media and Data Integrity Errors:
-	 */
-	case NVME_SC_WRITE_FAULT:
-	case NVME_SC_READ_ERROR:
-	case NVME_SC_GUARD_CHECK:
-	case NVME_SC_APPTAG_CHECK:
-	case NVME_SC_REFTAG_CHECK:
-	case NVME_SC_COMPARE_FAILED:
-	case NVME_SC_ACCESS_DENIED:
-	case NVME_SC_UNWRITTEN_BLOCK:
-		return false;
-	}
-
-	/* Everything else could be a path failure, so should be retried */
-	return true;
-}
-
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ec44935690f..f31b496c863f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -396,7 +396,6 @@ extern const struct block_device_operations nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
 void nvme_failover_req(struct request *req);
-bool nvme_req_needs_failover(struct request *req);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns_head *head);
@@ -416,10 +415,6 @@ struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 static inline void nvme_failover_req(struct request *req)
 {
 }
-static inline bool nvme_req_needs_failover(struct request *req)
-{
-	return false;
-}
 static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 }
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [for-4.16 PATCH v2 4/5] dm mpath: use NVMe error handling to know when an error is retryable
  2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (2 preceding siblings ...)
  2017-12-27  3:22 ` [for-4.16 PATCH v2 3/5] nvme: move nvme_req_needs_failover() from multipath to core Mike Snitzer
@ 2017-12-27  3:22 ` Mike Snitzer
  2017-12-27  3:22 ` [for-4.16 PATCH v2 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin Mike Snitzer
  2018-01-02 23:29 ` [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Keith Busch
  5 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2017-12-27  3:22 UTC (permalink / raw)
  To: axboe, hch, keith.busch
  Cc: emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

Like NVMe's native multipath support, DM multipath's NVMe bio-based
support now allows NVMe core's error handling to requeue an NVMe blk-mq
request's bios onto DM multipath's queued_bios list for resubmission
once fail_path() occurs.  multipath_failover_rq() serves as a
replacement for the traditional multipath_end_io_bio().

DM multipath's bio submission to NVMe must be done in terms that allow
the reuse of NVMe core's error handling.  The following care is taken to
realize this reuse:

- NVMe core won't attempt to retry an IO if it has
  REQ_FAILFAST_TRANSPORT set; so only set it in __map_bio().

- Setup underlying request_queue's 'failover_rq_fn' callback, to use
  multipath_failover_rq, so that NVMe blk-mq requests use it
  if/when NVMe core determines a request must be retried.
  (a new target_type 'cleanup_device' hook is established to properly
   reset each underlying requests_queue's 'failover_rq_fn' on final
   teardown of the multipath device)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c         | 71 +++++++++++++++++++++++++++++++++++++++++--
 drivers/md/dm-table.c         |  2 ++
 include/linux/device-mapper.h |  3 ++
 3 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3198093006e4..875df8ad6efe 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -584,6 +584,8 @@ static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
 		return ERR_PTR(-EAGAIN);
 	}
 
+	bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
+
 	return pgpath;
 }
 
@@ -641,7 +643,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
 
 	bio->bi_status = 0;
 	bio_set_dev(bio, pgpath->path.dev->bdev);
-	bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
 
 	if (pgpath->pg->ps.type->start_io)
 		pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
@@ -855,6 +856,8 @@ static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **
 	return 0;
 }
 
+static void multipath_failover_rq(struct request *rq);
+
 static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
 				 struct dm_target *ti)
 {
@@ -879,7 +882,10 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 		goto bad;
 	}
 
-	if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
+	if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
+		struct request_queue *q = bdev_get_queue(p->path.dev->bdev);
+		q->failover_rq_fn = multipath_failover_rq;
+	} else {
 		INIT_DELAYED_WORK(&p->activate_path, activate_path_work);
 		if (setup_scsi_dh(p->path.dev->bdev, m, &ti->error)) {
 			dm_put_device(ti, p->path.dev);
@@ -1610,6 +1616,14 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
 	unsigned long flags;
 	int r = DM_ENDIO_DONE;
 
+	/*
+	 * NVMe bio-based only needs to update path selector (on
+	 * success or errors that NVMe deemed non-retryable)
+	 * - retryable errors are handled by multipath_failover_rq
+	 */
+	if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
+		goto done;
+
 	if (!*error || !retry_error(*error))
 		goto done;
 
@@ -1645,6 +1659,43 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
 	return r;
 }
 
+/*
+ * multipath_failover_rq serves as a replacement for multipath_end_io_bio
+ * for all bios in a request with a retryable error.
+ */
+static void multipath_failover_rq(struct request *rq)
+{
+	struct dm_target *ti = dm_bio_get_target(rq->bio);
+	struct multipath *m = ti->private;
+	struct dm_mpath_io *mpio = get_mpio_from_bio(rq->bio);
+	struct pgpath *pgpath = mpio->pgpath;
+	unsigned long flags;
+
+	if (pgpath) {
+		struct path_selector *ps = &pgpath->pg->ps;
+
+		if (ps->type->end_io)
+			ps->type->end_io(ps, &pgpath->path, blk_rq_bytes(rq));
+
+		fail_path(pgpath);
+	}
+
+	if (atomic_read(&m->nr_valid_paths) == 0 &&
+	    !test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) &&
+	    !must_push_back_bio(m)) {
+		dm_report_EIO(m);
+		blk_mq_end_request(rq, BLK_STS_IOERR);
+		return;
+	}
+
+	spin_lock_irqsave(&m->lock, flags);
+	blk_steal_bios(&m->queued_bios, rq);
+	spin_unlock_irqrestore(&m->lock, flags);
+	queue_work(kmultipathd, &m->process_queued_bios);
+
+	blk_mq_end_request(rq, 0);
+}
+
 /*
  * Suspend can't complete until all the I/O is processed so if
  * the last path fails we must error any remaining I/O.
@@ -2029,12 +2080,25 @@ static int multipath_busy(struct dm_target *ti)
 	return busy;
 }
 
+static void multipath_cleanup_device(struct dm_target *ti, struct dm_dev *dev)
+{
+	struct multipath *m = ti->private;
+	struct request_queue *q;
+
+	if (m->queue_mode != DM_TYPE_NVME_BIO_BASED)
+		return;
+
+	q = bdev_get_queue(dev->bdev);
+	if (q)
+		q->failover_rq_fn = NULL;
+}
+
 /*-----------------------------------------------------------------
  * Module setup
  *---------------------------------------------------------------*/
 static struct target_type multipath_target = {
 	.name = "multipath",
-	.version = {1, 12, 0},
+	.version = {1, 13, 0},
 	.features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
 	.module = THIS_MODULE,
 	.ctr = multipath_ctr,
@@ -2052,6 +2116,7 @@ static struct target_type multipath_target = {
 	.prepare_ioctl = multipath_prepare_ioctl,
 	.iterate_devices = multipath_iterate_devices,
 	.busy = multipath_busy,
+	.cleanup_device = multipath_cleanup_device,
 };
 
 static int __init dm_multipath_init(void)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index ad4ac294dd57..86d7530384c3 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -517,6 +517,8 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
 		return;
 	}
 	if (refcount_dec_and_test(&dd->count)) {
+		if (ti->type->cleanup_device)
+			ti->type->cleanup_device(ti, d);
 		dm_put_table_device(ti->table->md, d);
 		list_del(&dd->list);
 		kfree(dd);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index e46ad2ada674..758feae899f9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -92,6 +92,8 @@ typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv);
 typedef int (*dm_prepare_ioctl_fn) (struct dm_target *ti,
 			    struct block_device **bdev, fmode_t *mode);
 
+typedef void (*dm_cleanup_device_fn) (struct dm_target *ti, struct dm_dev *dev);
+
 /*
  * These iteration functions are typically used to check (and combine)
  * properties of underlying devices.
@@ -181,6 +183,7 @@ struct target_type {
 	dm_message_fn message;
 	dm_prepare_ioctl_fn prepare_ioctl;
 	dm_busy_fn busy;
+	dm_cleanup_device_fn cleanup_device;
 	dm_iterate_devices_fn iterate_devices;
 	dm_io_hints_fn io_hints;
 	dm_dax_direct_access_fn direct_access;
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [for-4.16 PATCH v2 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin
  2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (3 preceding siblings ...)
  2017-12-27  3:22 ` [for-4.16 PATCH v2 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
@ 2017-12-27  3:22 ` Mike Snitzer
  2018-01-02 23:29 ` [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Keith Busch
  5 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2017-12-27  3:22 UTC (permalink / raw)
  To: axboe, hch, keith.busch
  Cc: emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

Add a 'skip_endio_hook' flag member to 'struct dm_target' that if set
instructs calls to .end_io (or .rq_end_io) to be skipped.

NVMe bio-based doesn't use multipath_end_io_bio() for anything other
than updating the path-selector.  So it can be avoided completely if the
round-robin path selector is used (because round-robin doesn't have an
end_io hook).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c         | 24 ++++++++++++++++++++----
 drivers/md/dm-rq.c            |  8 ++++----
 drivers/md/dm.c               |  7 ++++---
 include/linux/device-mapper.h |  6 ++++++
 4 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 875df8ad6efe..fc0f1940b107 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1187,6 +1187,21 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
 		goto bad;
 	}
 
+	/*
+	 * If NVMe bio-based and all path selectors don't provide .end_io hook:
+	 * inform DM core that there is no need to call this target's end_io hook.
+	 */
+	if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
+		struct priority_group *pg;
+		if (!m->nr_priority_groups)
+			goto finish;
+		list_for_each_entry(pg, &m->priority_groups, list) {
+			if (pg->ps.type->end_io)
+				goto finish;
+		}
+		ti->skip_end_io_hook = true;
+	}
+finish:
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
@@ -1672,11 +1687,12 @@ static void multipath_failover_rq(struct request *rq)
 	unsigned long flags;
 
 	if (pgpath) {
-		struct path_selector *ps = &pgpath->pg->ps;
-
-		if (ps->type->end_io)
-			ps->type->end_io(ps, &pgpath->path, blk_rq_bytes(rq));
+		if (!ti->skip_end_io_hook) {
+			struct path_selector *ps = &pgpath->pg->ps;
 
+			if (ps->type->end_io)
+				ps->type->end_io(ps, &pgpath->path, blk_rq_bytes(rq));
+		}
 		fail_path(pgpath);
 	}
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9d32f25489c2..64206743da8f 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -285,12 +285,12 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
 {
 	int r = DM_ENDIO_DONE;
 	struct dm_rq_target_io *tio = clone->end_io_data;
-	dm_request_endio_fn rq_end_io = NULL;
+	struct dm_target *ti = tio->ti;
 
-	if (tio->ti) {
-		rq_end_io = tio->ti->type->rq_end_io;
+	if (ti) {
+		dm_request_endio_fn rq_end_io = ti->type->rq_end_io;
 
-		if (mapped && rq_end_io)
+		if (mapped && rq_end_io && !ti->skip_end_io_hook)
 			r = rq_end_io(tio->ti, clone, error, &tio->info);
 	}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9f4c4a7fd40d..5c83d9dcbfe8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -937,7 +937,8 @@ static void clone_endio(struct bio *bio)
 	struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
 	struct dm_io *io = tio->io;
 	struct mapped_device *md = tio->io->md;
-	dm_endio_fn endio = tio->ti->type->end_io;
+	struct dm_target *ti = tio->ti;
+	dm_endio_fn endio = ti->type->end_io;
 
 	if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
 		if (bio_op(bio) == REQ_OP_WRITE_SAME &&
@@ -948,8 +949,8 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
-	if (endio) {
-		int r = endio(tio->ti, bio, &error);
+	if (endio && !ti->skip_end_io_hook) {
+		int r = endio(ti, bio, &error);
 		switch (r) {
 		case DM_ENDIO_REQUEUE:
 			error = BLK_STS_DM_REQUEUE;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 758feae899f9..f3c4bd29083f 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -310,6 +310,12 @@ struct dm_target {
 	 * on max_io_len boundary.
 	 */
 	bool split_discard_bios:1;
+
+	/*
+	 * Set if there is no need to call this target's end_io hook
+	 * (be it .end_io or .end_io_rq).
+	 */
+	bool skip_end_io_hook:1;
 };
 
 /* Each target can link one of these into the table */
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 1/5] block: establish request failover callback
  2017-12-27  3:22 ` [for-4.16 PATCH v2 1/5] block: establish request failover callback Mike Snitzer
@ 2017-12-29 10:10   ` Christoph Hellwig
  2017-12-29 20:19     ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-12-29 10:10 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, hch, keith.busch, emilne, james.smart, hare,
	Bart.VanAssche, linux-block, linux-nvme, dm-devel

On Tue, Dec 26, 2017 at 10:22:53PM -0500, Mike Snitzer wrote:
> All requests allocated from a request_queue with this callback set can
> failover their requests during completion.
> 
> This callback is expected to use the blk_steal_bios() interface to
> transfer a request's bios back to an upper-layer bio-based
> request_queue.
> 
> This will be used by both NVMe multipath and DM multipath.  Without it
> DM multipath cannot get access to NVMe-specific error handling that NVMe
> core provides in nvme_complete_rq().

And the whole point is that it should not get any such access.

The reason why we did nvme multipathing differently is because the
design of dm-multipath inflicts so much pain on users that we absolutely
want to avoid it this time around.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover
  2017-12-27  3:22 ` [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover Mike Snitzer
@ 2017-12-29 10:11   ` Christoph Hellwig
  2017-12-29 20:22     ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-12-29 10:11 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, hch, keith.busch, emilne, james.smart, hare,
	Bart.VanAssche, linux-block, linux-nvme, dm-devel

Independent on any policy question as raised in reply to the previous
patch:  until the ANA (ALUA equivalent for NVMe) spec has been finalized
and our code for it has landed there is no way we're going to expose
anything from this code based as it will be highly in flux.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 1/5] block: establish request failover callback
  2017-12-29 10:10   ` Christoph Hellwig
@ 2017-12-29 20:19     ` Mike Snitzer
  2018-01-04 10:28       ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2017-12-29 20:19 UTC (permalink / raw)
  To: Christoph Hellwig, axboe
  Cc: keith.busch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

On Fri, Dec 29 2017 at  5:10am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Dec 26, 2017 at 10:22:53PM -0500, Mike Snitzer wrote:
> > All requests allocated from a request_queue with this callback set can
> > failover their requests during completion.
> > 
> > This callback is expected to use the blk_steal_bios() interface to
> > transfer a request's bios back to an upper-layer bio-based
> > request_queue.
> > 
> > This will be used by both NVMe multipath and DM multipath.  Without it
> > DM multipath cannot get access to NVMe-specific error handling that NVMe
> > core provides in nvme_complete_rq().
> 
> And the whole point is that it should not get any such access.

No the whole point is you hijacked multipathing for little to no gain.

> The reason why we did nvme multipathing differently is because the
> design of dm-multipath inflicts so much pain on users that we absolutely
> want to avoid it this time around.

Is that the royal "we"?

_You_ are the one subjecting users to pain.  There is no reason users
should need to have multiple management domains for multipathing unless
they opt-in.  Linux _is_ about choice, yet you're working overtime to
limit that choice.

You are blatantly ignoring/rejecting both Hannes [1] and I.  Your
attempt to impose _how_ NVMe multipathing must be done is unacceptable.

Hopefully Jens can see through your senseless position and will accept
patches 1 - 3 for 4.16.  They offer very minimal change that enables
users to decide which multipathing they'd prefer to use with NVMe.

Just wish you could stop with this petty bullshit and actually
collaborate with people.

I've shown how easy it is to enable NVMe multipathing in terms of DM
multipath (yet preserve your native NVMe multipathing).  Please stop
being so dogmatic.  Are you scared of being proven wrong about what the
market wants?

If you'd allow progress toward native NVMe and DM multipathing
coexisting we'd let the users decide what they prefer.  I don't need to
impose one way or the other, but I _do_ need to preserve DM multipath
compatibility given the extensive use of DM multipath in the enterprise
and increased tooling that builds upon it.

[1] http://lists.infradead.org/pipermail/linux-nvme/2017-October/013719.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover
  2017-12-29 10:11   ` Christoph Hellwig
@ 2017-12-29 20:22     ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2017-12-29 20:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, keith.busch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

On Fri, Dec 29 2017 at  5:11am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> Independent on any policy question as raised in reply to the previous
> patch:  until the ANA (ALUA equivalent for NVMe) spec has been finalized
> and our code for it has landed there is no way we're going to expose
> anything from this code based as it will be highly in flux.

Please stop hiding behind specs like they are governing this aspect of
Linux's NVMe implementation.

Please just own the need for NVMe to provide DM multipathing
compatibility.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
                   ` (4 preceding siblings ...)
  2017-12-27  3:22 ` [for-4.16 PATCH v2 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin Mike Snitzer
@ 2018-01-02 23:29 ` Keith Busch
  2018-01-03  0:24   ` Mike Snitzer
                     ` (2 more replies)
  5 siblings, 3 replies; 18+ messages in thread
From: Keith Busch @ 2018-01-02 23:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, hch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

Instead of hiding NVMe path related errors, the NVMe driver needs to
code an appropriate generic block status from an NVMe status.

We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
set, so I think it's silly NVMe native multipathing has a second status
decoder. This just doubles the work if we need to handle any new NVMe
status codes in the future.

I have a counter-proposal below that unifies NVMe-to-block status
translations, and combines common code for determining if an error is a
path failure. This should work for both NVMe and DM, and DM won't need
NVMe specifics.

I can split this into a series if there's indication this is ok and
satisfies the need.

---
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f7810cc869ac..f6f7a1aefc53 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1475,21 +1475,6 @@ static void activate_path_work(struct work_struct *work)
 	activate_or_offline_path(pgpath);
 }
 
-static int noretry_error(blk_status_t error)
-{
-	switch (error) {
-	case BLK_STS_NOTSUPP:
-	case BLK_STS_NOSPC:
-	case BLK_STS_TARGET:
-	case BLK_STS_NEXUS:
-	case BLK_STS_MEDIUM:
-		return 1;
-	}
-
-	/* Anything else could be a path failure, so should be retried */
-	return 0;
-}
-
 static int multipath_end_io(struct dm_target *ti, struct request *clone,
 			    blk_status_t error, union map_info *map_context)
 {
@@ -1508,7 +1493,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
 	 * request into dm core, which will remake a clone request and
 	 * clone bios for it and resubmit it later.
 	 */
-	if (error && !noretry_error(error)) {
+	if (error && blk_path_failure(error)) {
 		struct multipath *m = ti->private;
 
 		r = DM_ENDIO_REQUEUE;
@@ -1544,7 +1529,7 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
 	unsigned long flags;
 	int r = DM_ENDIO_DONE;
 
-	if (!*error || noretry_error(*error))
+	if (!*error || !blk_path_failure(*error))
 		goto done;
 
 	if (pgpath)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..25ef349fd4e4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -190,20 +190,18 @@ static inline bool nvme_req_needs_retry(struct request *req)
 
 void nvme_complete_rq(struct request *req)
 {
-	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-		if (nvme_req_needs_failover(req)) {
-			nvme_failover_req(req);
-			return;
-		}
+	blk_status_t status = nvme_error_status(req);
 
+	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
+		if (nvme_failover_req(req, status))
+			return;
 		if (!blk_queue_dying(req->q)) {
 			nvme_req(req)->retries++;
 			blk_mq_requeue_request(req, true);
 			return;
 		}
 	}
-
-	blk_mq_end_request(req, nvme_error_status(req));
+	blk_mq_end_request(req, status);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1218a9fca846..fa3d96780258 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -19,11 +19,16 @@ module_param(multipath, bool, 0644);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
 
-void nvme_failover_req(struct request *req)
+bool nvme_failover_req(struct request *req, blk_status_t status)
 {
 	struct nvme_ns *ns = req->q->queuedata;
 	unsigned long flags;
 
+	if (!(req->cmd_flags & REQ_NVME_MPATH))
+		return false;
+	if (!blk_path_failure(status))
+		return false;
+
 	spin_lock_irqsave(&ns->head->requeue_lock, flags);
 	blk_steal_bios(&ns->head->requeue_list, req);
 	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
@@ -31,52 +36,6 @@ void nvme_failover_req(struct request *req)
 
 	nvme_reset_ctrl(ns->ctrl);
 	kblockd_schedule_work(&ns->head->requeue_work);
-}
-
-bool nvme_req_needs_failover(struct request *req)
-{
-	if (!(req->cmd_flags & REQ_NVME_MPATH))
-		return false;
-
-	switch (nvme_req(req)->status & 0x7ff) {
-	/*
-	 * Generic command status:
-	 */
-	case NVME_SC_INVALID_OPCODE:
-	case NVME_SC_INVALID_FIELD:
-	case NVME_SC_INVALID_NS:
-	case NVME_SC_LBA_RANGE:
-	case NVME_SC_CAP_EXCEEDED:
-	case NVME_SC_RESERVATION_CONFLICT:
-		return false;
-
-	/*
-	 * I/O command set specific error.  Unfortunately these values are
-	 * reused for fabrics commands, but those should never get here.
-	 */
-	case NVME_SC_BAD_ATTRIBUTES:
-	case NVME_SC_INVALID_PI:
-	case NVME_SC_READ_ONLY:
-	case NVME_SC_ONCS_NOT_SUPPORTED:
-		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
-			nvme_fabrics_command);
-		return false;
-
-	/*
-	 * Media and Data Integrity Errors:
-	 */
-	case NVME_SC_WRITE_FAULT:
-	case NVME_SC_READ_ERROR:
-	case NVME_SC_GUARD_CHECK:
-	case NVME_SC_APPTAG_CHECK:
-	case NVME_SC_REFTAG_CHECK:
-	case NVME_SC_COMPARE_FAILED:
-	case NVME_SC_ACCESS_DENIED:
-	case NVME_SC_UNWRITTEN_BLOCK:
-		return false;
-	}
-
-	/* Everything else could be a path failure, so should be retried */
 	return true;
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..a251d1df4895 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -400,8 +400,7 @@ extern const struct attribute_group nvme_ns_id_attr_group;
 extern const struct block_device_operations nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
-void nvme_failover_req(struct request *req);
-bool nvme_req_needs_failover(struct request *req);
+bool nvme_failover_req(struct request *req, blk_status_t status);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns_head *head);
@@ -418,10 +417,7 @@ static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
 }
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 #else
-static inline void nvme_failover_req(struct request *req)
-{
-}
-static inline bool nvme_req_needs_failover(struct request *req)
+static inline bool nvme_failover_req(struct request *req, blk_status_t status)
 {
 	return false;
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1e628e032da..54e416acf55f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,21 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+static inline bool blk_path_failure(blk_status_t status)
+{
+	switch (status) {
+	case BLK_STS_NOTSUPP:
+	case BLK_STS_NOSPC:
+	case BLK_STS_TARGET:
+	case BLK_STS_NEXUS:
+	case BLK_STS_MEDIUM:
+	case BLK_STS_PROTECTION:
+		return false;
+	}
+	/* Anything else could be a path failure, so should be retried */
+	return true;
+}
+
 struct blk_issue_stat {
 	u64 stat;
 };
--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2018-01-02 23:29 ` [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Keith Busch
@ 2018-01-03  0:24   ` Mike Snitzer
  2018-01-04 10:26   ` Christoph Hellwig
  2018-01-08  6:52   ` Hannes Reinecke
  2 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2018-01-03  0:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, hch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

On Tue, Jan 02 2018 at  6:29pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> Instead of hiding NVMe path related errors, the NVMe driver needs to
> code an appropriate generic block status from an NVMe status.
> 
> We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
> set, so I think it's silly NVMe native multipathing has a second status
> decoder. This just doubles the work if we need to handle any new NVMe
> status codes in the future.
> 
> I have a counter-proposal below that unifies NVMe-to-block status
> translations, and combines common code for determining if an error is a
> path failure. This should work for both NVMe and DM, and DM won't need
> NVMe specifics.

I'm happy with this approach.

FYI, I did recently invert the logic on dm-mpath.c:noretry_error() and
staged for 4.16, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=806f913a543e30d0d608a823b11613bbeeba1f6d
But I can drop that patch and rebase accordingly.

Also, I think a better name for blk_path_failure() would be
blk_retryable_error()?  Don't feel that strongly about it though.

> I can split this into a series if there's indication this is ok and
> satisfies the need.

Don't think it needs to be a series, having it be a single patch shows
why it makes sense to elevate the retryable error check to block core.

But maybe Jens will be able to give you more guidance on what he'd like
to see.

I'd very much like to see this land in 4.16.

Thanks!
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2018-01-02 23:29 ` [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Keith Busch
  2018-01-03  0:24   ` Mike Snitzer
@ 2018-01-04 10:26   ` Christoph Hellwig
  2018-01-04 14:08     ` Mike Snitzer
  2018-01-04 16:26     ` Mike Snitzer
  2018-01-08  6:52   ` Hannes Reinecke
  2 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2018-01-04 10:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mike Snitzer, axboe, hch, emilne, james.smart, hare,
	Bart.VanAssche, linux-block, linux-nvme, dm-devel

On Tue, Jan 02, 2018 at 04:29:43PM -0700, Keith Busch wrote:
> Instead of hiding NVMe path related errors, the NVMe driver needs to
> code an appropriate generic block status from an NVMe status.
> 
> We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
> set, so I think it's silly NVMe native multipathing has a second status
> decoder. This just doubles the work if we need to handle any new NVMe
> status codes in the future.
> 
> I have a counter-proposal below that unifies NVMe-to-block status
> translations, and combines common code for determining if an error is a
> path failure. This should work for both NVMe and DM, and DM won't need
> NVMe specifics.
> 
> I can split this into a series if there's indication this is ok and
> satisfies the need.

You'll need to update nvme_error_status to account for all errors
handled in nvme_req_needs_failover, and you will probably have to
add additional BLK_STS_* code.  But if this is all that the rage was
about I'm perfectly fine with it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 1/5] block: establish request failover callback
  2017-12-29 20:19     ` Mike Snitzer
@ 2018-01-04 10:28       ` Christoph Hellwig
  2018-01-04 14:42         ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2018-01-04 10:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, axboe, keith.busch, emilne, james.smart, hare,
	Bart.VanAssche, linux-block, linux-nvme, dm-devel,
	Martin K. Petersen, Hannes Reinecke

On Fri, Dec 29, 2017 at 03:19:04PM -0500, Mike Snitzer wrote:
> On Fri, Dec 29 2017 at  5:10am -0500,
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Tue, Dec 26, 2017 at 10:22:53PM -0500, Mike Snitzer wrote:
> > > All requests allocated from a request_queue with this callback set can
> > > failover their requests during completion.
> > > 
> > > This callback is expected to use the blk_steal_bios() interface to
> > > transfer a request's bios back to an upper-layer bio-based
> > > request_queue.
> > > 
> > > This will be used by both NVMe multipath and DM multipath.  Without it
> > > DM multipath cannot get access to NVMe-specific error handling that NVMe
> > > core provides in nvme_complete_rq().
> > 
> > And the whole point is that it should not get any such access.
> 
> No the whole point is you hijacked multipathing for little to no gain.

That is your idea.  In the end there have been a lot of complains about
dm-multipath, and there was a lot of discussion how to do things better,
with a broad agreement on this approach.  Up to the point where Hannes
has started considering doing something similar for scsi.

And to be honest if this is the tone you'd like to set for technical
discussions I'm not really interested.  Please calm down and stick
to a technical discussion.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2018-01-04 10:26   ` Christoph Hellwig
@ 2018-01-04 14:08     ` Mike Snitzer
  2018-01-04 16:26     ` Mike Snitzer
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2018-01-04 14:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, axboe, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel

On Thu, Jan 04 2018 at  5:26am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Jan 02, 2018 at 04:29:43PM -0700, Keith Busch wrote:
> > Instead of hiding NVMe path related errors, the NVMe driver needs to
> > code an appropriate generic block status from an NVMe status.
> > 
> > We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
> > set, so I think it's silly NVMe native multipathing has a second status
> > decoder. This just doubles the work if we need to handle any new NVMe
> > status codes in the future.
> > 
> > I have a counter-proposal below that unifies NVMe-to-block status
> > translations, and combines common code for determining if an error is a
> > path failure. This should work for both NVMe and DM, and DM won't need
> > NVMe specifics.
> > 
> > I can split this into a series if there's indication this is ok and
> > satisfies the need.
> 
> You'll need to update nvme_error_status to account for all errors
> handled in nvme_req_needs_failover, and you will probably have to
> add additional BLK_STS_* code.  But if this is all that the rage was
> about I'm perfectly fine with it.

Glad you're fine with it.  I thought you'd balk at this too.  Mainly
because I was unaware nvme_error_status() existed; so I thought any
amount of new NVMe error translation for upper-layer consumption would
be met with resistence.

Keith arrived at this approach based on an exchange we had in private.

I gave him context for DM multipath's need to access the code NVMe uses
to determine if an NVMe-specific error is retryable or not.  Explained
how SCSI uses scsi_dh error handling and
drivers/scsi/scsi_lib.c:__scsi_error_from_host_byte() to establish a
"differentiated IO error", and then
drivers/md/dm-mpath.c:noretry_error() consumes the resulting BLK_STS_*.

Armed with this context Keith was able to take his NVMe knowledge and
arrive at something you're fine with.  Glad it worked out.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 1/5] block: establish request failover callback
  2018-01-04 10:28       ` Christoph Hellwig
@ 2018-01-04 14:42         ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2018-01-04 14:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, keith.busch, emilne, james.smart, hare, Bart.VanAssche,
	linux-block, linux-nvme, dm-devel, Martin K. Petersen,
	Hannes Reinecke

On Thu, Jan 04 2018 at  5:28am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Fri, Dec 29, 2017 at 03:19:04PM -0500, Mike Snitzer wrote:
> > On Fri, Dec 29 2017 at  5:10am -0500,
> > Christoph Hellwig <hch@lst.de> wrote:
> > 
> > > On Tue, Dec 26, 2017 at 10:22:53PM -0500, Mike Snitzer wrote:
> > > > All requests allocated from a request_queue with this callback set can
> > > > failover their requests during completion.
> > > > 
> > > > This callback is expected to use the blk_steal_bios() interface to
> > > > transfer a request's bios back to an upper-layer bio-based
> > > > request_queue.
> > > > 
> > > > This will be used by both NVMe multipath and DM multipath.  Without it
> > > > DM multipath cannot get access to NVMe-specific error handling that NVMe
> > > > core provides in nvme_complete_rq().
> > > 
> > > And the whole point is that it should not get any such access.
> > 
> > No the whole point is you hijacked multipathing for little to no gain.
> 
> That is your idea.  In the end there have been a lot of complains about
> dm-multipath, and there was a lot of discussion how to do things better,
> with a broad agreement on this approach.  Up to the point where Hannes
> has started considering doing something similar for scsi.

All the "DM multipath" complaints I heard at LSF were fixable and pretty
superficial.  Some less so, but Hannes had a vision for addressing
various SCSI stuff (which really complicated DM multipath).

But I'd really rather not dwell on all the history of NVMe native
multipathing's evolution.  It isn't productive (other than to
acknowledge that there are far more efficient and productive ways to
coordinate such a change).

> And to be honest if this is the tone you'd like to set for technical
> discussions I'm not really interested.  Please calm down and stick
> to a technical discussion.

I think you'd probably agree that you've repeatedly derailed or avoided
technical discussion if it got into "DM multipath".  But again I'm not
looking to dwell on how dysfunctional this has been.  I really do
appreciate your technical expertise.  Sadly, cannot say I feel you think
similarly of me.

I will say that I'm human, as such I have limits on what I'm willing to
accept.  You leveraged your position to the point where it has started
to feel like you were lording over me.  Tough to accept that.   It makes
my job _really_ feel like "work".  All I've ever been trying to do
(since accepting the reality of "NVMe native multipathing") is bridge
the gap from the old solution to new solution.  I'm not opposed to the
new solution, it just needs to mature without being the _only_ way to
provide the feature (NVMe multipathing).  Hopefully we can be productive
exchanges moving forward.

There are certainly some challenges associated with trying to allow a
kernel to support both NVMe native multipathing and DM multipathing.
E.g. would an NVMe device scan multipath blacklist be doable/acceptable?

I'd also like to understand if your vision for NVMe's ANA support will
model something like scsi_dh?  Meaning ANA is a capability that, when
attached, augments the behavior of the NVMe device but that it is
otherwise internal to the device and upper layers will get the benefit
of ANA handler being attached.  Also, curious to know if you see that as
needing to be tightly coupled to multipathing?  If so that is the next
interface point hurdle.

In the end I really think that DM multipath can help make NVMe native
multipath very robust (and vice-versa).

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2018-01-04 10:26   ` Christoph Hellwig
  2018-01-04 14:08     ` Mike Snitzer
@ 2018-01-04 16:26     ` Mike Snitzer
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2018-01-04 16:26 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: axboe, emilne, james.smart, hare, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

On Thu, Jan 04 2018 at  5:26am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Jan 02, 2018 at 04:29:43PM -0700, Keith Busch wrote:
> > Instead of hiding NVMe path related errors, the NVMe driver needs to
> > code an appropriate generic block status from an NVMe status.
> > 
> > We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
> > set, so I think it's silly NVMe native multipathing has a second status
> > decoder. This just doubles the work if we need to handle any new NVMe
> > status codes in the future.
> > 
> > I have a counter-proposal below that unifies NVMe-to-block status
> > translations, and combines common code for determining if an error is a
> > path failure. This should work for both NVMe and DM, and DM won't need
> > NVMe specifics.
> > 
> > I can split this into a series if there's indication this is ok and
> > satisfies the need.
> 
> You'll need to update nvme_error_status to account for all errors
> handled in nvme_req_needs_failover

These weren't accounted for (I may have mis-categorized them in this
patch, so please feel free to correct):

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 25ef349fd4e4..835a9c60200e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -165,10 +165,18 @@ static blk_status_t nvme_error_status(struct request *req)
 	case NVME_SC_ACCESS_DENIED:
 	case NVME_SC_READ_ONLY:
 		return BLK_STS_MEDIUM;
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_BAD_ATTRIBUTES:
+		return BLK_STS_TARGET;
 	case NVME_SC_GUARD_CHECK:
 	case NVME_SC_APPTAG_CHECK:
 	case NVME_SC_REFTAG_CHECK:
 	case NVME_SC_INVALID_PI:
+	case NVME_SC_COMPARE_FAILED:
 		return BLK_STS_PROTECTION;
 	case NVME_SC_RESERVATION_CONFLICT:
 		return BLK_STS_NEXUS;

And of those, these aren't currently used:

NVME_SC_LBA_RANGE
NVME_SC_CAP_EXCEEDED
NVME_SC_BAD_ATTRIBUTES
NVME_SC_COMPARE_FAILED

At least not that I can see.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler
  2018-01-02 23:29 ` [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Keith Busch
  2018-01-03  0:24   ` Mike Snitzer
  2018-01-04 10:26   ` Christoph Hellwig
@ 2018-01-08  6:52   ` Hannes Reinecke
  2 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2018-01-08  6:52 UTC (permalink / raw)
  To: Keith Busch, Mike Snitzer
  Cc: axboe, hch, emilne, james.smart, Bart.VanAssche, linux-block,
	linux-nvme, dm-devel

On 01/03/2018 12:29 AM, Keith Busch wrote:
> Instead of hiding NVMe path related errors, the NVMe driver needs to
> code an appropriate generic block status from an NVMe status.
> 
> We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
> set, so I think it's silly NVMe native multipathing has a second status
> decoder. This just doubles the work if we need to handle any new NVMe
> status codes in the future.
> 
> I have a counter-proposal below that unifies NVMe-to-block status
> translations, and combines common code for determining if an error is a
> path failure. This should work for both NVMe and DM, and DM won't need
> NVMe specifics.
> 
> I can split this into a series if there's indication this is ok and
> satisfies the need.
> 
I'm all for it. Go.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-01-08  6:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-27  3:22 [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Mike Snitzer
2017-12-27  3:22 ` [for-4.16 PATCH v2 1/5] block: establish request failover callback Mike Snitzer
2017-12-29 10:10   ` Christoph Hellwig
2017-12-29 20:19     ` Mike Snitzer
2018-01-04 10:28       ` Christoph Hellwig
2018-01-04 14:42         ` Mike Snitzer
2017-12-27  3:22 ` [for-4.16 PATCH v2 2/5] nvme: use request_queue's failover_rq_fn callback for multipath failover Mike Snitzer
2017-12-29 10:11   ` Christoph Hellwig
2017-12-29 20:22     ` Mike Snitzer
2017-12-27  3:22 ` [for-4.16 PATCH v2 3/5] nvme: move nvme_req_needs_failover() from multipath to core Mike Snitzer
2017-12-27  3:22 ` [for-4.16 PATCH v2 4/5] dm mpath: use NVMe error handling to know when an error is retryable Mike Snitzer
2017-12-27  3:22 ` [for-4.16 PATCH v2 5/5] dm mpath: skip calls to end_io_bio if using NVMe bio-based and round-robin Mike Snitzer
2018-01-02 23:29 ` [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler Keith Busch
2018-01-03  0:24   ` Mike Snitzer
2018-01-04 10:26   ` Christoph Hellwig
2018-01-04 14:08     ` Mike Snitzer
2018-01-04 16:26     ` Mike Snitzer
2018-01-08  6:52   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).