[PATCH 0/5] Improve blk-mq performance

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 0/5] Improve blk-mq performance
@ 2018-07-27 16:20 Bart Van Assche
  2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hello Jens,

The blk-mq timeout handling rework that went upstream during the v4.18
development cycle introduced a performance regression due to the atomic
instructions that were added in the hot path. This patch series improves
blk-mq performance by reducing the number of atomic instructions in the
hot path. This patch series introduces the number of IOPS reported by the
following test with 15% (fifteen per cent) on an x86_64 system:

#!/bin/bash

if [ -e /sys/kernel/config/nullb ]; then
    for d in /sys/kernel/config/nullb/*; do
        [ -d "$d" ] && rmdir "$d"
    done
fi
modprobe -r null_blk
[ -e /sys/module/null_blk ] && exit $?
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 4096 > blocksize &&
    echo 0 > home_node &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 1 > power ||
    exit $?

(
    cd /sys/block/nullb0/queue &&
	echo 2 > rq_affinity
) || exit $?

iodepth=${1:-1}
runtime=30
args=()
if [ "$iodepth" = 1 ]; then
	args+=(--ioengine=psync)
else
	args+=(--ioengine=libaio --iodepth_batch=$((iodepth/2)))
fi
args+=(--iodepth=$iodepth --name=nullb0 --filename=/dev/nullb0\
    --rw=read --bs=4096 --loops=$((1<<20)) --direct=1 --numjobs=1 --thread\
    --runtime=$runtime --invalidate=1 --gtod_reduce=1 --ioscheduler=none)
numactl -m 0 -N 0 -- fio "${args[@]}"


Bart Van Assche (5):
  blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER
  block: Remove a superfluous #include directive
  block: Split blk_add_timer()
  block: Simplify blk_add_timer() and blk_mq_add_timer()
  blk-mq: Rework blk-mq timeout handling again

 Documentation/scsi/scsi_eh.txt            |   4 +-
 block/blk-core.c                          |   3 +
 block/blk-mq-debugfs.c                    |   3 +-
 block/blk-mq.c                            | 284 ++++++++++++++++------
 block/blk-mq.h                            |  10 +-
 block/blk-timeout.c                       | 109 +++++----
 block/blk.h                               |   1 +
 drivers/block/mtip32xx/mtip32xx.c         |   2 +-
 drivers/block/nbd.c                       |   4 +-
 drivers/block/null_blk_main.c             |   4 +-
 drivers/message/fusion/mptsas.c           |   2 +-
 drivers/mmc/core/queue.c                  |   2 +-
 drivers/nvme/host/pci.c                   |  10 +-
 drivers/nvme/host/rdma.c                  |   2 +-
 drivers/nvme/target/loop.c                |   2 +-
 drivers/s390/block/dasd.c                 |   6 +-
 drivers/scsi/gdth.c                       |   2 +-
 drivers/scsi/libiscsi.c                   |   6 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |   2 +-
 drivers/scsi/mvumi.c                      |   2 +-
 drivers/scsi/qla4xxx/ql4_os.c             |   2 +-
 drivers/scsi/scsi_error.c                 |  18 +-
 drivers/scsi/scsi_transport_fc.c          |   4 +-
 drivers/scsi/scsi_transport_srp.c         |   4 +-
 drivers/scsi/ufs/ufshcd.c                 |   6 +-
 include/linux/blk-mq.h                    |  14 --
 include/linux/blkdev.h                    |  52 +++-
 27 files changed, 367 insertions(+), 193 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:20 ` [PATCH 2/5] block: Remove a superfluous #include directive Bart Van Assche
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Ming Lei,
	Hannes Reinecke

If a block driver timeout handler returns BLK_EH_DONE that means
either that the request has been completed or that the block driver
still owns the request. Since the name BLK_EH_DONE is misleading,
change it into BLK_EH_DONT_RESET_TIMER.

Fixes: 88b0cfad2888 ("block: document the blk_eh_timer_return values")
Fixes: 6600593cbd93 ("block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
---
 Documentation/scsi/scsi_eh.txt            |  4 ++--
 block/blk-mq.c                            |  2 +-
 block/blk-timeout.c                       |  2 +-
 drivers/block/mtip32xx/mtip32xx.c         |  2 +-
 drivers/block/nbd.c                       |  4 ++--
 drivers/block/null_blk_main.c             |  4 ++--
 drivers/message/fusion/mptsas.c           |  2 +-
 drivers/mmc/core/queue.c                  |  2 +-
 drivers/nvme/host/pci.c                   | 10 +++++-----
 drivers/nvme/host/rdma.c                  |  2 +-
 drivers/nvme/target/loop.c                |  2 +-
 drivers/s390/block/dasd.c                 |  6 +++---
 drivers/scsi/gdth.c                       |  2 +-
 drivers/scsi/libiscsi.c                   |  6 +++---
 drivers/scsi/megaraid/megaraid_sas_base.c |  2 +-
 drivers/scsi/mvumi.c                      |  2 +-
 drivers/scsi/qla4xxx/ql4_os.c             |  2 +-
 drivers/scsi/scsi_error.c                 |  4 ++--
 drivers/scsi/scsi_transport_fc.c          |  4 ++--
 drivers/scsi/scsi_transport_srp.c         |  4 ++--
 drivers/scsi/ufs/ufshcd.c                 |  6 +++---
 include/linux/blkdev.h                    |  2 +-
 22 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt
index 1b7436932a2b..59e085b28b31 100644
--- a/Documentation/scsi/scsi_eh.txt
+++ b/Documentation/scsi/scsi_eh.txt
@@ -86,9 +86,9 @@ function
 	This indicates that more time is required to finish the
 	command.  Timer is restarted.  This action is counted as a
 	retry and only allowed scmd->allowed + 1(!) times.  Once the
-	limit is reached, action for BLK_EH_DONE is taken instead.
+	limit is reached, action for BLK_EH_DONT_RESET_TIMER is taken instead.
 
-    - BLK_EH_DONE
+    - BLK_EH_DONT_RESET_TIMER
         eh_timed_out() callback did not handle the command.
 	Step #2 is taken.
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c92ce06fd565..96e1a7f25875 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -774,7 +774,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		enum blk_eh_timer_return ret;
 
 		ret = req->q->mq_ops->timeout(req, reserved);
-		if (ret == BLK_EH_DONE)
+		if (ret == BLK_EH_DONT_RESET_TIMER)
 			return;
 		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
 	}
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index f2cfd56e1606..37df7f8f8516 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -90,7 +90,7 @@ static void blk_rq_timed_out(struct request *req)
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
-	case BLK_EH_DONE:
+	case BLK_EH_DONT_RESET_TIMER:
 		/*
 		 * LLD handles this for now but in the future
 		 * we can send a request msg to abort the command
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index c73626decb46..e28be7821d0c 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3712,7 +3712,7 @@ static enum blk_eh_timer_return mtip_cmd_timeout(struct request *req,
 
 		cmd->status = BLK_STS_TIMEOUT;
 		blk_mq_complete_request(req);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	if (test_bit(req->tag, dd->port->cmds_to_issue))
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 3fb95c8d9fd8..67e17965e462 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -382,7 +382,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 			mutex_unlock(&cmd->lock);
 			nbd_requeue_cmd(cmd);
 			nbd_config_put(nbd);
-			return BLK_EH_DONE;
+			return BLK_EH_DONT_RESET_TIMER;
 		}
 	} else {
 		dev_err_ratelimited(nbd_to_dev(nbd),
@@ -395,7 +395,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	nbd_config_put(nbd);
 done:
 	blk_mq_complete_request(req);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 /*
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 86cafa6d3b41..3617c9ac251a 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1320,7 +1320,7 @@ static enum blk_eh_timer_return null_rq_timed_out_fn(struct request *rq)
 {
 	pr_info("null: rq %p timed out\n", rq);
 	__blk_complete_request(rq);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static int null_rq_prep_fn(struct request_queue *q, struct request *req)
@@ -1383,7 +1383,7 @@ static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res)
 {
 	pr_info("null: rq %p timed out\n", rq);
 	blk_mq_complete_request(rq);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 76a66da33996..b703ad68f426 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -1929,7 +1929,7 @@ static enum blk_eh_timer_return mptsas_eh_timed_out(struct scsi_cmnd *sc)
 	MPT_SCSI_HOST *hd;
 	MPT_ADAPTER   *ioc;
 	VirtDevice    *vdevice;
-	enum blk_eh_timer_return rc = BLK_EH_DONE;
+	enum blk_eh_timer_return rc = BLK_EH_DONT_RESET_TIMER;
 
 	hd = shost_priv(sc->device->host);
 	if (hd == NULL) {
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 648eb6743ed5..ea32dcd6bfcf 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -113,7 +113,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
 		}
 		/* No timeout (XXX: huh? comment doesn't make much sense) */
 		blk_mq_complete_request(req);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	default:
 		/* Timeout is handled by mmc core */
 		return BLK_EH_RESET_TIMER;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcae11bbf3a..efb034b0b98e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1211,7 +1211,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		nvme_warn_reset(dev, csts);
 		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	/*
@@ -1221,14 +1221,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, completion polled\n",
 			 req->tag, nvmeq->qid);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	/*
 	 * Shutdown immediately if controller times out while starting. The
 	 * reset work will see the pci device disabled when it gets the forced
 	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_DONE.
+	 * shutdown, so we return BLK_EH_DONT_RESET_TIMER.
 	 */
 	switch (dev->ctrl.state) {
 	case NVME_CTRL_CONNECTING:
@@ -1238,7 +1238,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1256,7 +1256,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		nvme_reset_ctrl(&dev->ctrl);
 
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0805fa6215ee..d18511427437 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1682,7 +1682,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
 	/* fail with DNR on cmd timeout */
 	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..419107cd952a 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -148,7 +148,7 @@ nvme_loop_timeout(struct request *rq, bool reserved)
 	/* fail with DNR on admin cmd timeout */
 	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index a9f60d0ee02e..d9bfebcc6aa7 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -3010,7 +3010,7 @@ static blk_status_t do_dasd_request(struct blk_mq_hw_ctx *hctx,
  *
  * Return values:
  * BLK_EH_RESET_TIMER if the request should be left running
- * BLK_EH_DONE if the request is handled or terminated
+ * BLK_EH_DONT_RESET_TIMER if the request is handled or terminated
  *		      by the driver.
  */
 enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
@@ -3023,7 +3023,7 @@ enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
 
 	cqr = blk_mq_rq_to_pdu(req);
 	if (!cqr)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 
 	spin_lock_irqsave(&cqr->dq->lock, flags);
 	device = cqr->startdev ? cqr->startdev : block->base;
@@ -3078,7 +3078,7 @@ enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
 	spin_unlock(&block->queue_lock);
 	spin_unlock_irqrestore(&cqr->dq->lock, flags);
 
-	return rc ? BLK_EH_RESET_TIMER : BLK_EH_DONE;
+	return rc ? BLK_EH_RESET_TIMER : BLK_EH_DONT_RESET_TIMER;
 }
 
 static int dasd_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 85604795d8ee..fa18d3d1dc9c 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -3882,7 +3882,7 @@ static enum blk_eh_timer_return gdth_timed_out(struct scsi_cmnd *scp)
 	struct gdth_cmndinfo *cmndinfo = gdth_cmnd_priv(scp);
 	u8 b, t;
 	unsigned long flags;
-	enum blk_eh_timer_return retval = BLK_EH_DONE;
+	enum blk_eh_timer_return retval = BLK_EH_DONT_RESET_TIMER;
 
 	TRACE(("%s() cmd 0x%x\n", scp->cmnd[0], __func__));
 	b = scp->device->channel;
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index d6093838f5f2..6ede7e01df30 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1963,7 +1963,7 @@ static int iscsi_has_ping_timed_out(struct iscsi_conn *conn)
 
 enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 {
-	enum blk_eh_timer_return rc = BLK_EH_DONE;
+	enum blk_eh_timer_return rc = BLK_EH_DONT_RESET_TIMER;
 	struct iscsi_task *task = NULL, *running_task;
 	struct iscsi_cls_session *cls_session;
 	struct iscsi_session *session;
@@ -1982,7 +1982,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 		 * Raced with completion. Blk layer has taken ownership
 		 * so let timeout code complete it now.
 		 */
-		rc = BLK_EH_DONE;
+		rc = BLK_EH_DONT_RESET_TIMER;
 		goto done;
 	}
 
@@ -1997,7 +1997,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 		if (unlikely(system_state != SYSTEM_RUNNING)) {
 			sc->result = DID_NO_CONNECT << 16;
 			ISCSI_DBG_EH(session, "sc on shutdown, handled\n");
-			rc = BLK_EH_DONE;
+			rc = BLK_EH_DONT_RESET_TIMER;
 			goto done;
 		}
 		/*
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 71d97573a667..bd5bc1da9be1 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -2768,7 +2768,7 @@ blk_eh_timer_return megasas_reset_timer(struct scsi_cmnd *scmd)
 
 	if (time_after(jiffies, scmd->jiffies_at_alloc +
 				(scmd_timeout * 2) * HZ)) {
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	instance = (struct megasas_instance *)scmd->device->host->hostdata;
diff --git a/drivers/scsi/mvumi.c b/drivers/scsi/mvumi.c
index b3cd9a6b1d30..84cdadd4bd78 100644
--- a/drivers/scsi/mvumi.c
+++ b/drivers/scsi/mvumi.c
@@ -2155,7 +2155,7 @@ static enum blk_eh_timer_return mvumi_timed_out(struct scsi_cmnd *scmd)
 	mvumi_return_cmd(mhba, cmd);
 	spin_unlock_irqrestore(mhba->shost->host_lock, flags);
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static int
diff --git a/drivers/scsi/qla4xxx/ql4_os.c b/drivers/scsi/qla4xxx/ql4_os.c
index 0e13349dce57..ecabca8d815b 100644
--- a/drivers/scsi/qla4xxx/ql4_os.c
+++ b/drivers/scsi/qla4xxx/ql4_os.c
@@ -1848,7 +1848,7 @@ static enum blk_eh_timer_return qla4xxx_eh_cmd_timed_out(struct scsi_cmnd *sc)
 	struct iscsi_cls_session *session;
 	struct iscsi_session *sess;
 	unsigned long flags;
-	enum blk_eh_timer_return ret = BLK_EH_DONE;
+	enum blk_eh_timer_return ret = BLK_EH_DONT_RESET_TIMER;
 
 	session = starget_to_session(scsi_target(sc->device));
 	sess = session->dd_data;
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2715cdaa669c..b38b8e62e618 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -283,7 +283,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 enum blk_eh_timer_return scsi_times_out(struct request *req)
 {
 	struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
-	enum blk_eh_timer_return rtn = BLK_EH_DONE;
+	enum blk_eh_timer_return rtn = BLK_EH_DONT_RESET_TIMER;
 	struct Scsi_Host *host = scmd->device->host;
 
 	trace_scsi_dispatch_cmd_timeout(scmd);
@@ -295,7 +295,7 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	if (host->hostt->eh_timed_out)
 		rtn = host->hostt->eh_timed_out(scmd);
 
-	if (rtn == BLK_EH_DONE) {
+	if (rtn == BLK_EH_DONT_RESET_TIMER) {
 		/*
 		 * For blk-mq, we must set the request state to complete now
 		 * before sending the request to the scsi error handler. This
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 13948102ca29..63d283f298f3 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -2087,7 +2087,7 @@ fc_eh_timed_out(struct scsi_cmnd *scmd)
 	if (rport->port_state == FC_PORTSTATE_BLOCKED)
 		return BLK_EH_RESET_TIMER;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 EXPORT_SYMBOL(fc_eh_timed_out);
 
@@ -3593,7 +3593,7 @@ fc_bsg_job_timeout(struct request *req)
 	/* the blk_end_sync_io() doesn't check the error */
 	if (inflight)
 		__blk_complete_request(req);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 /**
diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c
index 4e46fdb2d7c9..03571512dca5 100644
--- a/drivers/scsi/scsi_transport_srp.c
+++ b/drivers/scsi/scsi_transport_srp.c
@@ -604,7 +604,7 @@ EXPORT_SYMBOL(srp_reconnect_rport);
  *
  * If a timeout occurs while an rport is in the blocked state, ask the SCSI
  * EH to continue waiting (BLK_EH_RESET_TIMER). Otherwise let the SCSI core
- * handle the timeout (BLK_EH_DONE).
+ * handle the timeout (BLK_EH_DONT_RESET_TIMER).
  *
  * Note: This function is called from soft-IRQ context and with the request
  * queue lock held.
@@ -620,7 +620,7 @@ enum blk_eh_timer_return srp_timed_out(struct scsi_cmnd *scmd)
 	return rport && rport->fast_io_fail_tmo < 0 &&
 		rport->dev_loss_tmo < 0 &&
 		i->f->reset_timer_if_blocked && scsi_device_blocked(sdev) ?
-		BLK_EH_RESET_TIMER : BLK_EH_DONE;
+		BLK_EH_RESET_TIMER : BLK_EH_DONT_RESET_TIMER;
 }
 EXPORT_SYMBOL(srp_timed_out);
 
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 397081d320b1..2184222953b2 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -6682,12 +6682,12 @@ static enum blk_eh_timer_return ufshcd_eh_timed_out(struct scsi_cmnd *scmd)
 	bool found = false;
 
 	if (!scmd || !scmd->device || !scmd->device->host)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 
 	host = scmd->device->host;
 	hba = shost_priv(host);
 	if (!hba)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 
 	spin_lock_irqsave(host->host_lock, flags);
 
@@ -6705,7 +6705,7 @@ static enum blk_eh_timer_return ufshcd_eh_timed_out(struct scsi_cmnd *scmd)
 	 * SCSI command was not actually dispatched to UFS driver, otherwise
 	 * let SCSI layer handle the error as usual.
 	 */
-	return found ? BLK_EH_DONE : BLK_EH_RESET_TIMER;
+	return found ? BLK_EH_DONT_RESET_TIMER : BLK_EH_RESET_TIMER;
 }
 
 static const struct attribute_group *ufshcd_driver_groups[] = {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 050d599f5ea9..d5d31b7b6ba8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -328,7 +328,7 @@ typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
 typedef void (exit_rq_fn)(struct request_queue *, struct request *);
 
 enum blk_eh_timer_return {
-	BLK_EH_DONE,		/* drivers has completed the command */
+	BLK_EH_DONT_RESET_TIMER,/* driver completed or owns the command */
 	BLK_EH_RESET_TIMER,	/* reset timer and try again */
 };
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] block: Remove a superfluous #include directive
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Commit 12f5b9314545 ("blk-mq: Remove generation seqeunce") removed the
only u64_stats_sync instance from <linux/blkdev.h> but did not remove
the corresponding #include directive. Since the #include directive is
no longer needed, remove it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 include/linux/blkdev.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d5d31b7b6ba8..8244a5a1aa5b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -28,7 +28,6 @@
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
 #include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] block: Split blk_add_timer()
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
  2018-07-27 16:20 ` [PATCH 2/5] block: Remove a superfluous #include directive Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:29   ` Keith Busch
  2018-07-27 16:20 ` [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Split blk_add_timer() as follows:
- Introduce the helper function __blk_add_timer() that performs the
  tasks that apply both to the legacy block layer core and to blk-mq.
- Duplicate the remaining blk_add_timer() function into the new
  function blk_mq_add_timer().
- Change the blk_mq_timer() calls into blk_mq_add_timer() calls in
  blk-mq code.

This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c      |  4 +--
 block/blk-timeout.c | 81 +++++++++++++++++++++++++++++++++++----------
 block/blk.h         |  1 +
 3 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 96e1a7f25875..1b49973629f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -644,7 +644,7 @@ void blk_mq_start_request(struct request *rq)
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
-	blk_add_timer(rq);
+	blk_mq_add_timer(rq);
 	WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
@@ -779,7 +779,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
 	}
 
-	blk_add_timer(req);
+	blk_mq_add_timer(req);
 }
 
 static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 37df7f8f8516..527670f2f985 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -181,6 +181,33 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
+static void __blk_add_timer(struct request *req, unsigned long deadline)
+{
+	struct request_queue *q = req->q;
+	unsigned long expiry;
+
+	/*
+	 * If the timer isn't already pending or this timeout is earlier
+	 * than an existing one, modify the timer. Round up to next nearest
+	 * second.
+	 */
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
+	if (!timer_pending(&q->timeout) ||
+	    time_before(expiry, q->timeout.expires)) {
+		unsigned long diff = q->timeout.expires - expiry;
+
+		/*
+		 * Due to added timer slack to group timers, the timer
+		 * will often be a little in front of what we asked for.
+		 * So apply some tolerance here too, otherwise we keep
+		 * modifying the timer because expires for value X
+		 * will be X + something.
+		 */
+		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
+			mod_timer(&q->timeout, expiry);
+	}
+}
+
 /**
  * blk_add_timer - Start timeout timer for a single request
  * @req:	request that is about to start running.
@@ -192,7 +219,6 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 void blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
-	unsigned long expiry;
 
 	if (!q->mq_ops)
 		lockdep_assert_held(q->queue_lock);
@@ -220,26 +246,45 @@ void blk_add_timer(struct request *req)
 	if (!q->mq_ops)
 		list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
+	__blk_add_timer(req, blk_rq_deadline(req));
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ *
+ * Sets the deadline of a request. The caller must guarantee that the request
+ * state won't be modified while this function is in progress.
+ */
+void blk_mq_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+
+	if (!q->mq_ops)
+		lockdep_assert_held(q->queue_lock);
+
+	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
+	if (!q->mq_ops && !q->rq_timed_out_fn)
+		return;
+
+	BUG_ON(!list_empty(&req->timeout_list));
+
 	/*
-	 * If the timer isn't already pending or this timeout is earlier
-	 * than an existing one, modify the timer. Round up to next nearest
-	 * second.
+	 * Some LLDs, like scsi, peek at the timeout to prevent a
+	 * command from being retried forever.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
 
-	if (!timer_pending(&q->timeout) ||
-	    time_before(expiry, q->timeout.expires)) {
-		unsigned long diff = q->timeout.expires - expiry;
+	req->rq_flags &= ~RQF_TIMED_OUT;
+	blk_rq_set_deadline(req, jiffies + req->timeout);
 
-		/*
-		 * Due to added timer slack to group timers, the timer
-		 * will often be a little in front of what we asked for.
-		 * So apply some tolerance here too, otherwise we keep
-		 * modifying the timer because expires for value X
-		 * will be X + something.
-		 */
-		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
-			mod_timer(&q->timeout, expiry);
-	}
+	/*
+	 * Only the non-mq case needs to add the request to a protected list.
+	 * For the mq case we simply scan the tag map.
+	 */
+	if (!q->mq_ops)
+		list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
+	__blk_add_timer(req, blk_rq_deadline(req));
 }
diff --git a/block/blk.h b/block/blk.h
index 69b14cd2bb22..6adae8f94279 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,7 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
 
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer()
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  2018-07-27 16:22 ` [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Use the fact that q->mq_ops == NULL for the legacy block layer and
also that q->mq_ops != NULL for blk-mq to simplify blk_add_timer and
blk_mq_add_timer(). This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-timeout.c | 54 +++++++++------------------------------------
 1 file changed, 11 insertions(+), 43 deletions(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 527670f2f985..02d7e3427369 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -219,34 +219,21 @@ static void __blk_add_timer(struct request *req, unsigned long deadline)
 void blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
+	unsigned long deadline;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
+	lockdep_assert_held(q->queue_lock);
 
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
+	if (!q->rq_timed_out_fn)
 		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
 	req->rq_flags &= ~RQF_TIMED_OUT;
-	blk_rq_set_deadline(req, jiffies + req->timeout);
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
-	__blk_add_timer(req, blk_rq_deadline(req));
+	__blk_add_timer(req, deadline);
 }
 
 /**
@@ -259,32 +246,13 @@ void blk_add_timer(struct request *req)
 void blk_mq_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
+	unsigned long deadline;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
 	req->rq_flags &= ~RQF_TIMED_OUT;
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
-	__blk_add_timer(req, blk_rq_deadline(req));
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	__blk_add_timer(req, deadline);
 }
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
                   ` (3 preceding siblings ...)
  2018-07-27 16:20 ` [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:46   ` Keith Busch
  2018-07-27 16:22 ` [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  5 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Keith Busch, Jianchao Wang, Ming Lei

Recently the blk-mq timeout handling code was reworked to avoid that
completions that occur while a timeout handler is in progress get
ignored. However, that rework removed the protection against completions
that occur while a timeout handler is in progress. Fix this by
introducing a new request state, namely MQ_RQ_TIMED_OUT. This state
means that the timeout handler is in progress.

Other changes in this patch are:
- Reintroduce the request generation number to avoid that request
  timeout handling happens after a request has already completed.
- Reintroduce deadline_seq to make reads and updates of the request
  deadline possible without having to use atomic instructions.
- Remove the request state information that became superfluous due to
  this patch, namely the RQF_TIMED_OUT flag and the ref member.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- Move the code for managing the request state back from the SCSI
  core into the block layer core.

This patch reverts the following two commits:
* 0fc09f920983 ("blk-mq: export setting request completion state")
* 065990bd198e ("scsi: set timed out out mq requests to complete")

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          |   3 +
 block/blk-mq-debugfs.c    |   3 +-
 block/blk-mq.c            | 280 ++++++++++++++++++++++++++++----------
 block/blk-mq.h            |  10 +-
 block/blk-timeout.c       |  28 ++--
 drivers/scsi/scsi_error.c |  14 --
 include/linux/blk-mq.h    |  14 --
 include/linux/blkdev.h    |  49 +++++--
 8 files changed, 281 insertions(+), 120 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03a4ea93a5f3..3c63e410ede4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,6 +198,9 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..72abc9a87c53 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -335,8 +335,9 @@ static const char *const rqf_name[] = {
 
 static const char *const blk_mq_rq_state_name_array[] = {
 	[MQ_RQ_IDLE]		= "idle",
-	[MQ_RQ_IN_FLIGHT]	= "in_flight",
+	[MQ_RQ_IN_FLIGHT]	= "in flight",
 	[MQ_RQ_COMPLETE]	= "complete",
+	[MQ_RQ_TIMED_OUT]	= "timed out",
 };
 
 static const char *blk_mq_rq_state_name(enum mq_rq_state rq_state)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1b49973629f6..a97ab5ba9d18 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -6,6 +6,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/atomic.h>	/* cmpxchg() */
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -322,6 +323,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
@@ -332,7 +334,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 #endif
 
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
-	refcount_set(&rq->ref, 1);
 	return rq;
 }
 
@@ -466,19 +467,42 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-static void __blk_mq_free_request(struct request *rq)
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+				   enum mq_rq_state old_state,
+				   enum mq_rq_state new_state)
 {
-	struct request_queue *q = rq->q;
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	const int sched_tag = rq->internal_tag;
+	union blk_generation_and_state old_val = READ_ONCE(rq->gstate);
+	union blk_generation_and_state new_val = old_val;
 
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-	blk_mq_sched_restart(hctx);
-	blk_queue_exit(q);
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight or timed out to another state
+	 * cmpxchg() must be used. For other state transitions it is safe to
+	 * use WRITE_ONCE().
+	 */
+	switch (old_state) {
+	case MQ_RQ_IN_FLIGHT:
+	case MQ_RQ_TIMED_OUT:
+		WRITE_ONCE(rq->gstate.val, new_val.val);
+		return true;
+	case MQ_RQ_IDLE:
+	case MQ_RQ_COMPLETE:
+		return blk_mq_set_rq_state(rq, old_val, new_val);
+	}
+	WARN(true, "Invalid request state %d\n", old_state);
+	return false;
 }
 
 void blk_mq_free_request(struct request *rq)
@@ -487,6 +511,7 @@ void blk_mq_free_request(struct request *rq)
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	const int sched_tag = rq->internal_tag;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -509,9 +534,14 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
+	if (rq->tag != -1)
+		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+	if (sched_tag != -1)
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+	blk_mq_sched_restart(hctx);
+	blk_queue_exit(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -558,8 +588,8 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	if (!blk_mq_mark_complete(rq))
-		return;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 
@@ -615,7 +645,18 @@ void blk_mq_complete_request(struct request *rq)
 {
 	if (unlikely(blk_should_fake_timeout(rq->q)))
 		return;
-	__blk_mq_complete_request(rq);
+again:
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+		__blk_mq_complete_request(rq);
+		return;
+	}
+	if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
+		return;
+	if (WARN_ON_ONCE(blk_mq_rq_state(rq) == MQ_RQ_IDLE ||
+			 blk_mq_rq_state(rq) == MQ_RQ_COMPLETE))
+		return;
+	/* In the unlikely case of a race with the timeout code, retry. */
+	goto again;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -645,7 +686,7 @@ void blk_mq_start_request(struct request *rq)
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	blk_mq_add_timer(rq);
-	WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
+	WARN_ON_ONCE(!blk_mq_change_rq_state(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT));
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -658,23 +699,46 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
+/**
+ * __blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ *
+ * This function is either called by blk_mq_requeue_request() or by the block
+ * layer core if .queue_rq() returned BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
+ * If the request state is MQ_RQ_IN_FLIGHT and if this function is called from
+ * inside .queue_rq() then it is guaranteed that the timeout code won't try to
+ * modify the request state while this function is in progress because an RCU
+ * read lock is held around .queue_rq() and because the timeout code calls
+ * synchronize_rcu() after having marked requests and before processing
+ * time-outs.
+ */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	rq_qos_requeue(q, rq);
 
-	if (blk_mq_request_started(rq)) {
-		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
-		rq->rq_flags &= ~RQF_TIMED_OUT;
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
 }
 
+/**
+ * blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ * @kick_requeue_list: whether or not to kick the requeue_list
+ *
+ * This function is called after a request has completed, after a request has
+ * timed out or from inside .queue_rq(). In the latter case the request may
+ * already have been started.
+ */
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
 	__blk_mq_requeue_request(rq);
@@ -767,82 +831,117 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
+struct blk_mq_timeout_data {
+	unsigned long next;
+	unsigned int next_set;
+	unsigned int nr_expired;
+};
+
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
-	req->rq_flags |= RQF_TIMED_OUT;
-	if (req->q->mq_ops->timeout) {
-		enum blk_eh_timer_return ret;
+	enum blk_eh_timer_return ret;
 
-		ret = req->q->mq_ops->timeout(req, reserved);
-		if (ret == BLK_EH_DONT_RESET_TIMER)
-			return;
-		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
+	if (!req->q->mq_ops->timeout) {
+		blk_mq_add_timer(req);
+		return;
 	}
 
+	ret = req->q->mq_ops->timeout(req, reserved);
+	/*
+	 * BLK_EH_DONT_RESET_TIMER means that the block driver either
+	 * completed the request or still owns the request and will
+	 * continue processing the timeout asynchronously. In the
+	 * latter case, if blk_mq_complete_request() was called while
+	 * the timeout handler was in progress, ignore that call.
+	 */
+	if (ret == BLK_EH_DONT_RESET_TIMER)
+		return;
+	WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
 	blk_mq_add_timer(req);
+again:
+	if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_IN_FLIGHT))
+		return;
+	if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
+		__blk_mq_complete_request(req);
+		return;
+	}
+	if (WARN_ON_ONCE(blk_mq_rq_state(req) == MQ_RQ_IDLE ||
+			 blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT))
+		return;
+	/*
+	 * In the unlikely case of a race with the request completion code,
+	 * retry.
+	 */
+	goto again;
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, struct request *rq,
+				 void *priv, bool reserved)
 {
+	struct blk_mq_timeout_data *data = priv;
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
 	unsigned long deadline;
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
-		return false;
-	if (rq->rq_flags & RQF_TIMED_OUT)
-		return false;
+	might_sleep();
 
-	deadline = blk_rq_deadline(rq);
-	if (time_after_eq(jiffies, deadline))
-		return true;
+#ifdef CONFIG_64BIT
+	deadline = rq->__deadline;
+#else
+	while (true) {
+		int start;
 
-	if (*next == 0)
-		*next = deadline;
-	else if (time_after(*next, deadline))
-		*next = deadline;
-	return false;
+		start = read_seqcount_begin(&rq->deadline_seq);
+		deadline = rq->__deadline;
+		if (!read_seqcount_retry(&rq->deadline_seq, start))
+			break;
+		cond_resched();
+	}
+#endif
+
+	/*
+	 * Make sure that rq->aborted_gstate != rq->gstate if a request gets
+	 * recycled before blk_mq_terminate_expired() is called.
+	 */
+	rq->aborted_gstate = gstate;
+	rq->aborted_gstate.generation ^= (1UL << 29);
+	if (gstate.state == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		/* request timed out */
+		rq->aborted_gstate = gstate;
+		data->nr_expired++;
+		hctx->nr_expired++;
+	} else if (!data->next_set || time_after(data->next, deadline)) {
+		data->next = deadline;
+		data->next_set = 1;
+	}
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
-	unsigned long *next = priv;
+	union blk_generation_and_state old_val = rq->aborted_gstate;
+	union blk_generation_and_state new_val = old_val;
 
-	/*
-	 * Just do a quick check if it is expired before locking the request in
-	 * so we're not unnecessarilly synchronizing across CPUs.
-	 */
-	if (!blk_mq_req_expired(rq, next))
-		return;
-
-	/*
-	 * We have reason to believe the request may be expired. Take a
-	 * reference on the request to lock this request lifetime into its
-	 * currently allocated context to prevent it from being reallocated in
-	 * the event the completion by-passes this timeout handler.
-	 *
-	 * If the reference was already released, then the driver beat the
-	 * timeout handler to posting a natural completion.
-	 */
-	if (!refcount_inc_not_zero(&rq->ref))
-		return;
+	new_val.state = MQ_RQ_TIMED_OUT;
 
 	/*
-	 * The request is now locked and cannot be reallocated underneath the
-	 * timeout handler's processing. Re-verify this exact request is truly
-	 * expired; if it is not expired, then the request was completed and
-	 * reallocated as a new request.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->gstate has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (blk_mq_req_expired(rq, next))
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	unsigned long next = 0;
+	struct blk_mq_timeout_data data = {
+		.next		= 0,
+		.next_set	= 0,
+		.nr_expired	= 0,
+	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -862,10 +961,41 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
+	/* scan for the expired ones and set their ->aborted_gstate */
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
+
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished. This is also
+		 * necessary to avoid races with blk_mq_requeue_request() for
+		 * requests that have already been started.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->srcu);
 
-	if (next != 0) {
-		mod_timer(&q->timeout, next);
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+	}
+
+	if (data.next_set) {
+		data.next = blk_rq_timeout(round_jiffies_up(data.next));
+		mod_timer(&q->timeout, data.next);
 	} else {
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
@@ -2009,7 +2139,9 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47e2526..de92cddc147f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -90,13 +90,21 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_generation_and_state old_val,
+				       union blk_generation_and_state new_val)
+{
+	return cmpxchg(&rq->gstate.val, old_val.val, new_val.val) ==
+		old_val.val;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
 static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->state);
+	return READ_ONCE(rq->gstate).state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 02d7e3427369..b37b8090f3ae 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -142,6 +142,22 @@ void blk_timeout_work(struct work_struct *work)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
+/* Update deadline to @time. May be called from interrupt context. */
+static void blk_mq_rq_set_deadline(struct request *rq, unsigned long new_time)
+{
+#ifdef CONFIG_64BIT
+	rq->__deadline = new_time;
+#else
+	unsigned long flags;
+
+	local_irq_save(flags);
+	write_seqcount_begin(&rq->deadline_seq);
+	rq->__deadline = new_time;
+	write_seqcount_end(&rq->deadline_seq);
+	local_irq_restore(flags);
+#endif
+}
+
 /**
  * blk_abort_request -- Request request recovery for the specified command
  * @req:	pointer to the request of interest
@@ -155,11 +171,10 @@ void blk_abort_request(struct request *req)
 {
 	if (req->q->mq_ops) {
 		/*
-		 * All we need to ensure is that timeout scan takes place
-		 * immediately and that scan sees the new timeout value.
-		 * No need for fancy synchronizations.
+		 * Ensure that a timeout scan takes place immediately and that
+		 * that scan sees the new timeout value.
 		 */
-		blk_rq_set_deadline(req, jiffies);
+		blk_mq_rq_set_deadline(req, jiffies);
 		kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
@@ -228,7 +243,6 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	req->rq_flags &= ~RQF_TIMED_OUT;
 	deadline = jiffies + req->timeout;
 	blk_rq_set_deadline(req, deadline);
 	list_add_tail(&req->timeout_list, &req->q->timeout_list);
@@ -250,9 +264,7 @@ void blk_mq_add_timer(struct request *req)
 
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
-
-	req->rq_flags &= ~RQF_TIMED_OUT;
 	deadline = jiffies + req->timeout;
-	blk_rq_set_deadline(req, deadline);
+	blk_mq_rq_set_deadline(req, deadline);
 	__blk_add_timer(req, deadline);
 }
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index b38b8e62e618..fa18f9d78170 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -296,20 +296,6 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 		rtn = host->hostt->eh_timed_out(scmd);
 
 	if (rtn == BLK_EH_DONT_RESET_TIMER) {
-		/*
-		 * For blk-mq, we must set the request state to complete now
-		 * before sending the request to the scsi error handler. This
-		 * will prevent a use-after-free in the event the LLD manages
-		 * to complete the request before the error handler finishes
-		 * processing this timed out request.
-		 *
-		 * If the request was already completed, then the LLD beat the
-		 * time out handler from transferring the request to the scsi
-		 * error handler. In that case we can return immediately as no
-		 * further action is required.
-		 */
-		if (req->q->mq_ops && !blk_mq_mark_complete(req))
-			return rtn;
 		if (scsi_abort_command(scmd) != SUCCESS) {
 			set_host_byte(scmd, DID_TIME_OUT);
 			scsi_eh_scmd_add(scmd);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1da59c16f637..d710e92874cc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,20 +289,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-	return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-			MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8244a5a1aa5b..58e5b22123a2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -126,20 +126,39 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
-/* ->timeout has been called, don't expire again */
-#define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
-/*
- * Request state for blk-mq.
+union blk_generation_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+	};
+	uint32_t val;
+};
+
+/**
+ * enum mq_rq_state - blk-mq request state
+ *
+ * The legal state transitions are:
+ * - idle      -> in flight: blk_mq_start_request()
+ * - in flight -> complete:  blk_mq_complete_request()
+ * - timed out -> complete:  blk_mq_complete_request()
+ * - in flight -> timed out: request times out
+ * - complete  -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - in flight -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> in flight: request restart due to BLK_EH_RESET_TIMER
+ *
+ * See also blk_generation_and_state.state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
+	MQ_RQ_TIMED_OUT		= 3,
 };
 
 /*
@@ -242,12 +261,26 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	enum mq_rq_state state;
-	refcount_t ref;
-
 	unsigned int timeout;
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	union blk_generation_and_state aborted_gstate;
+
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues. deadline_seq protects __deadline in blk-mq mode
+	 * only.
+	 */
+	union blk_generation_and_state gstate;
+#ifndef CONFIG_64BIT
+	seqcount_t deadline_seq;
+#endif
 	unsigned long __deadline;
 
 	struct list_head timeout_list;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve blk-mq performance
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
                   ` (4 preceding siblings ...)
  2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-07-27 16:22 ` Bart Van Assche
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:22 UTC (permalink / raw)
  To: axboe@kernel.dk; +Cc: hch@lst.de, linux-block@vger.kernel.org

On Fri, 2018-07-27 at 09:20 -0700, Bart Van Assche wrote:
> hot path. This patch series introduces the number of IOPS reported by=
 the
                              ^^^^^^^^^^
This is a typo - I meant "improves".

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: Split blk_add_timer()
  2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
@ 2018-07-27 16:29   ` Keith Busch
  2018-07-27 16:31     ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 16:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei

On Fri, Jul 27, 2018 at 09:20:40AM -0700, Bart Van Assche wrote:
> +void blk_mq_add_timer(struct request *req)
> +{
> +	struct request_queue *q = req->q;
> +
> +	if (!q->mq_ops)
> +		lockdep_assert_held(q->queue_lock);
> +
> +	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
> +	if (!q->mq_ops && !q->rq_timed_out_fn)
> +		return;

<snip>

> +	if (!q->mq_ops)
> +		list_add_tail(&req->timeout_list, &req->q->timeout_list);

Do you really need these checks for !q->mq_ops?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: Split blk_add_timer()
  2018-07-27 16:29   ` Keith Busch
@ 2018-07-27 16:31     ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:31 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, ming.lei@redhat.com, linux-block@vger.kernel.org,
	axboe@kernel.dk, jianchao.w.wang@oracle.com

On Fri, 2018-07-27 at 10:29 -0600, Keith Busch wrote:
> On Fri, Jul 27, 2018 at 09:20:40AM -0700, Bart Van Assche wrote:
> > +	if (!q->mq_ops)
> > +		list_add_tail(&req->timeout_list, &=
req->q->timeout_list);
>=20
> Do you really need these checks for !q->mq_ops?

Hello Keith,

Have you noticed that patch 4/5 removes that check?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-07-27 16:46   ` Keith Busch
  2018-07-27 16:51     ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 16:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Jianchao Wang, Ming Lei

On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> +	ret = req->q->mq_ops->timeout(req, reserved);
> +	/*
> +	 * BLK_EH_DONT_RESET_TIMER means that the block driver either
> +	 * completed the request or still owns the request and will
> +	 * continue processing the timeout asynchronously. In the
> +	 * latter case, if blk_mq_complete_request() was called while
> +	 * the timeout handler was in progress, ignore that call.
> +	 */
> +	if (ret == BLK_EH_DONT_RESET_TIMER)
> +		return;

This is how completions get lost.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:46   ` Keith Busch
@ 2018-07-27 16:51     ` Bart Van Assche
  2018-07-27 16:57       ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:51 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, ming.lei@redhat.com, linux-block@vger.kernel.org,
	martin.petersen@oracle.com, axboe@kernel.dk,
	jianchao.w.wang@oracle.com

On Fri, 2018-07-27 at 10:46 -0600, Keith Busch wrote:
> On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> > +	ret = req->q->mq_ops->timeout(req, reser=
ved);
> > +	/*
> > +	 * BLK_EH_DONT_RESET_TIMER means that th=
e block driver either
> > +	 * completed the request or still owns the request and w=
ill
> > +	 * continue processing the timeout asynchronously. In th=
e
> > +	 * latter case, if blk_mq_complete_request()=
 was called while
> > +	 * the timeout handler was in progress, ignore that call=
.
> > +	 */
> > +	if (ret == BLK_EH_DONT_RESET_TIMER)
> > +		return;
>=20
> This is how completions get lost.

The new approach for handling completions that occur while the .timeout()
callback in progress is as follows:
* blk_mq_complete_request() executes the following code:
       if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED=
F8-OUT, MQ_RQ_COMPLETE))
               return;
* blk_mq_rq_timed_out() executes the following code:
       if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE=
) {
               __blk_mq_complete_request(req);
               return;
       }

As one can see __blk_mq_complete_request() gets called if=
 this race occurs.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:51     ` Bart Van Assche
@ 2018-07-27 16:57       ` Keith Busch
  2018-07-27 16:59         ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 16:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@lst.de, ming.lei@redhat.com, linux-block@vger.kernel.org,
	martin.petersen@oracle.com, axboe@kernel.dk,
	jianchao.w.wang@oracle.com

On Fri, Jul 27, 2018 at 04:51:05PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 10:46 -0600, Keith Busch wrote:
> > On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> > > +	ret = req->q->mq_ops->timeout(req, reserved);
> > > +	/*
> > > +	 * BLK_EH_DONT_RESET_TIMER means that the block driver either
> > > +	 * completed the request or still owns the request and will
> > > +	 * continue processing the timeout asynchronously. In the
> > > +	 * latter case, if blk_mq_complete_request() was called while
> > > +	 * the timeout handler was in progress, ignore that call.
> > > +	 */
> > > +	if (ret == BLK_EH_DONT_RESET_TIMER)
> > > +		return;
> > 
> > This is how completions get lost.
> 
> The new approach for handling completions that occur while the .timeout()
> callback in progress is as follows:
> * blk_mq_complete_request() executes the following code:
>        if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
>                return;
> * blk_mq_rq_timed_out() executes the following code:
>        if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
>                __blk_mq_complete_request(req);
>                return;
>        }
> 
> As one can see __blk_mq_complete_request() gets called if this race occurs.

You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:57       ` Keith Busch
@ 2018-07-27 16:59         ` Bart Van Assche
  2018-07-27 17:04           ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:59 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, jianchao.w.wang@oracle.com,
	linux-block@vger.kernel.org, martin.petersen@oracle.com,
	ming.lei@redhat.com, axboe@kernel.dk

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1138 bytes --]

On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
+AD4- You skip that code if the driver returns BLK+AF8-EH+AF8-DONT+AF8-RESE=
T+AF8-TIMER.

How about applying the following patch on top of this series?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a97ab5ba9d18..aa66535604fd 100644
--- a/block/blk-mq.c
+-+-+- b/block/blk-mq.c
+AEAAQA- -854,10 +-854,10 +AEAAQA- static void blk+AF8-mq+AF8-rq+AF8-timed+=
AF8-out(struct request +ACo-req, bool reserved)
 	 +ACo- latter case, if blk+AF8-mq+AF8-complete+AF8-request() was called w=
hile
 	 +ACo- the timeout handler was in progress, ignore that call.
 	 +ACo-/
-	if (ret +AD0APQ- BLK+AF8-EH+AF8-DONT+AF8-RESET+AF8-TIMER)
-		return+ADs-
-	WARN+AF8-ON+AF8-ONCE(ret +ACEAPQ- BLK+AF8-EH+AF8-RESET+AF8-TIMER)+ADs-
-	blk+AF8-mq+AF8-add+AF8-timer(req)+ADs-
+-	if (ret +AD0APQ- BLK+AF8-EH+AF8-RESET+AF8-TIMER)
+-		blk+AF8-mq+AF8-add+AF8-timer(req)+ADs-
+-	else
+-		WARN+AF8-ON+AF8-ONCE(ret +ACEAPQ- BLK+AF8-EH+AF8-DONT+AF8-RESET+AF8-TIM=
ER)+ADs-
 again:
 	if (blk+AF8-mq+AF8-change+AF8-rq+AF8-state(req, MQ+AF8-RQ+AF8-TIMED+AF8-O=
UT, MQ+AF8-RQ+AF8-IN+AF8-FLIGHT))
 		return+ADs-

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:59         ` Bart Van Assche
@ 2018-07-27 17:04           ` Keith Busch
  2018-07-27 17:14             ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 17:04 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@lst.de, jianchao.w.wang@oracle.com,
	linux-block@vger.kernel.org, martin.petersen@oracle.com,
	ming.lei@redhat.com, axboe@kernel.dk

On Fri, Jul 27, 2018 at 04:59:34PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.
> 
> How about applying the following patch on top of this series?

That works for me if you, but it breaks scsi again when
scmd_eh_abort_handler completes the command a second time.

 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a97ab5ba9d18..aa66535604fd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -854,10 +854,10 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	 * latter case, if blk_mq_complete_request() was called while
>  	 * the timeout handler was in progress, ignore that call.
>  	 */
> -	if (ret == BLK_EH_DONT_RESET_TIMER)
> -		return;
> -	WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
> -	blk_mq_add_timer(req);
> +	if (ret == BLK_EH_RESET_TIMER)
> +		blk_mq_add_timer(req);
> +	else
> +		WARN_ON_ONCE(ret != BLK_EH_DONT_RESET_TIMER);
>  again:
>  	if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_IN_FLIGHT))
>  		return;
> 
> Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 17:04           ` Keith Busch
@ 2018-07-27 17:14             ` Bart Van Assche
  2018-07-27 17:58               ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 17:14 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org,
	martin.petersen@oracle.com, jianchao.w.wang@oracle.com,
	ming.lei@redhat.com

On Fri, 2018-07-27 at 11:04 -0600, Keith Busch wrote:
> On Fri, Jul 27, 2018 at 04:59:34PM +0000, Bart Van Assche wrote:
> > On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > > You skip that code if the driver returns BLK_EH_D=
ONT_RESET_TIMER.
> >=20
> > How about applying the following patch on top of this series?
>=20
> That works for me if you, but it breaks scsi again when
> scmd_eh_abort_handler completes the command a second time=
.

How about introducing a new request queue flag that chooses between the
behavior with or without the patch in my previous e-mail? I don't think tha=
t
it is possible to come up with a single implementation that covers the need=
s
of NVMe and SCSI without introducing such a flag. If a SCSI request times o=
ut
then request ownership is transferred from the LLD to the error handler. Fo=
r
the NVMe driver however there is no such transfer of ownership.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 17:14             ` Bart Van Assche
@ 2018-07-27 17:58               ` Keith Busch
  2018-07-27 18:09                 ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 17:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org,
	martin.petersen@oracle.com, jianchao.w.wang@oracle.com,
	ming.lei@redhat.com

On Fri, Jul 27, 2018 at 05:14:18PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 11:04 -0600, Keith Busch wrote:
> > On Fri, Jul 27, 2018 at 04:59:34PM +0000, Bart Van Assche wrote:
> > > On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > > > You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.
> > > 
> > > How about applying the following patch on top of this series?
> > 
> > That works for me if you, but it breaks scsi again when
> > scmd_eh_abort_handler completes the command a second time.
> 
> How about introducing a new request queue flag that chooses between the
> behavior with or without the patch in my previous e-mail? I don't think that
> it is possible to come up with a single implementation that covers the needs
> of NVMe and SCSI without introducing such a flag. If a SCSI request times out
> then request ownership is transferred from the LLD to the error handler. For
> the NVMe driver however there is no such transfer of ownership.

Instead of PATCH 1/5, how about creating a new timeout return code like
"BLK_EH_DONT_COMPLETE"?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 17:58               ` Keith Busch
@ 2018-07-27 18:09                 ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 18:09 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, ming.lei@redhat.com, linux-block@vger.kernel.org,
	martin.petersen@oracle.com, axboe@kernel.dk,
	jianchao.w.wang@oracle.com

On Fri, 2018-07-27 at 11:58 -0600, Keith Busch wrote:
> Instead of PATCH 1/5, how about creating a new timeout return code li=
ke
> "BLK_EH_DONT_COMPLETE"?

That sounds like a good idea to me. I think this approach will avoid that w=
e
have to introduce a request queue flag that chooses between the behavior
needed by the NVMe driver or the behavior needed by the SCSI core.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-07-27 18:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
2018-07-27 16:20 ` [PATCH 2/5] block: Remove a superfluous #include directive Bart Van Assche
2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
2018-07-27 16:29   ` Keith Busch
2018-07-27 16:31     ` Bart Van Assche
2018-07-27 16:20 ` [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-07-27 16:46   ` Keith Busch
2018-07-27 16:51     ` Bart Van Assche
2018-07-27 16:57       ` Keith Busch
2018-07-27 16:59         ` Bart Van Assche
2018-07-27 17:04           ` Keith Busch
2018-07-27 17:14             ` Bart Van Assche
2018-07-27 17:58               ` Keith Busch
2018-07-27 18:09                 ` Bart Van Assche
2018-07-27 16:22 ` [PATCH 0/5] Improve blk-mq performance Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox