linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/8] block/scsi: safe SCSI quiescing
@ 2017-09-02 13:08 Ming Lei
  2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
can be dispatched to SCSI successfully, and scsi_device_quiesce() just simply
waits for completion of I/Os dispatched to SCSI stack. It isn't enough at all.

Because new request still can be allocated, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated, and system may
hang forever, such as during system suspend or SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt freez, and tries to solve the issue
by preempt freezing block queue during SCSI quiesce, and allows
to allocate request of RQF_PREEMPT when queue is preempt-frozen.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing blk_freeze_queue_preempt() and
blk_unfreeze_queue_preempt(), also unifying current interfaces for
freezing queue between block legacy and blk-mq.

Oleksandr has verified that this patchset V2 fixes his I/O hang
during suspend/resume cycle.

V3:
	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
	- call blk_queue_enter_live() only when queue is preempt frozen
	- cleanup a bit on the implementation of preempt freeze
	- only patch 6 and 7 are changed

V2:
	- drop the 1st patch in V1 because percpu_ref_is_dying() is
	enough as pointed by Tejun
	- introduce preempt version of blk_[freeze|unfreeze]_queue
	- sync between preempt freeze and normal freeze
	- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013&r=3&w=2



Ming Lei (8):
  blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  blk-mq: only run hw queues for blk-mq
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: tracking request allocation with q_usage_counter
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  block: allow to allocate req with REQF_PREEMPT when queue is preempt
    frozen
  SCSI: preempt freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c      |   2 +-
 block/blk-cgroup.c       |   8 +--
 block/blk-core.c         |  53 ++++++++++++---
 block/blk-mq.c           | 170 +++++++++++++++++++++++++++++++++++++++--------
 block/blk-mq.h           |   1 -
 block/blk.h              |  17 +++++
 block/elevator.c         |   4 +-
 drivers/block/loop.c     |  16 ++---
 drivers/block/rbd.c      |   2 +-
 drivers/nvme/host/core.c |   8 +--
 drivers/scsi/scsi_lib.c  |  22 +++++-
 include/linux/blk-mq.h   |  15 +++--
 include/linux/blkdev.h   |  21 +++++-
 13 files changed, 273 insertions(+), 66 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

We will support to freeze queue on block legacy path too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-cgroup.c       |  4 ++--
 block/blk-mq.c           | 10 +++++-----
 block/elevator.c         |  2 +-
 drivers/block/loop.c     |  8 ++++----
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  2 +-
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0480892e97e5..02e8a47ac77c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1337,7 +1337,7 @@ int blkcg_activate_policy(struct request_queue *q,
 	spin_unlock_irq(q->queue_lock);
 out_bypass_end:
 	if (q->mq_ops)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	else
 		blk_queue_bypass_end(q);
 	if (pd_prealloc)
@@ -1388,7 +1388,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 	spin_unlock_irq(q->queue_lock);
 
 	if (q->mq_ops)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	else
 		blk_queue_bypass_end(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d935f15c54da..82136e83951d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -172,7 +172,7 @@ void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;
 
@@ -183,7 +183,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
-EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
@@ -2250,7 +2250,7 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	}
 }
 
@@ -2708,7 +2708,7 @@ static int __blk_mq_update_nr_requests(struct request_queue *q,
 	if (!ret)
 		q->nr_requests = nr;
 
-	blk_mq_unfreeze_queue(q);
+	blk_unfreeze_queue(q);
 
 	return ret;
 }
@@ -2757,7 +2757,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	}
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 }
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
diff --git a/block/elevator.c b/block/elevator.c
index 0e465809d3f3..371c8165c9e8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -994,7 +994,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
-	blk_mq_unfreeze_queue(q);
+	blk_unfreeze_queue(q);
 	return ret;
 }
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2fbd4089c20e..5c11ea44d470 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -217,7 +217,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
 	else
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 }
 
 static int
@@ -605,7 +605,7 @@ static int loop_switch(struct loop_device *lo, struct file *file)
 	do_loop_switch(lo, &w);
 
 	/* unfreeze */
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	return 0;
 }
@@ -1079,7 +1079,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_unbound;
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
 		loop_reread_partitions(lo, bdev);
@@ -1191,7 +1191,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	__loop_update_dio(lo, lo->use_dio);
 
  exit:
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
 	     !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 37046ac2c441..5c76b0a96be2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1226,7 +1226,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 
 	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
-	blk_mq_unfreeze_queue(disk->queue);
+	blk_unfreeze_queue(disk->queue);
 }
 
 static int nvme_revalidate_disk(struct gendisk *disk)
@@ -2753,7 +2753,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
+		blk_unfreeze_queue(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 13f6c25fa461..2572e5641568 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,7 +257,7 @@ void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
-void blk_mq_unfreeze_queue(struct request_queue *q);
+void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
  2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

This APIs will be used by legacy path too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c      |  2 +-
 block/blk-cgroup.c       |  4 ++--
 block/blk-mq.c           | 17 ++++-------------
 block/blk-mq.h           |  1 -
 block/elevator.c         |  2 +-
 drivers/block/loop.c     |  8 ++++----
 drivers/block/rbd.c      |  2 +-
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 9 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 509f39998011..ce2b00e897e2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4757,7 +4757,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	 * The invocation of the next bfq_create_group_hierarchy
 	 * function is the head of a chain of function calls
 	 * (bfq_create_group_hierarchy->blkcg_activate_policy->
-	 * blk_mq_freeze_queue) that may lead to the invocation of the
+	 * blk_freeze_queue) that may lead to the invocation of the
 	 * has_work hook function. For this reason,
 	 * bfq_create_group_hierarchy is invoked only after all
 	 * scheduler data has been initialized, apart from the fields
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02e8a47ac77c..87c15f3947d5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1296,7 +1296,7 @@ int blkcg_activate_policy(struct request_queue *q,
 		return 0;
 
 	if (q->mq_ops)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 	else
 		blk_queue_bypass_start(q);
 pd_prealloc:
@@ -1363,7 +1363,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 		return;
 
 	if (q->mq_ops)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 	else
 		blk_queue_bypass_start(q);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 82136e83951d..8cf1f7cbef2b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -161,16 +161,7 @@ void blk_freeze_queue(struct request_queue *q)
 	blk_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
-
-void blk_mq_freeze_queue(struct request_queue *q)
-{
-	/*
-	 * ...just an alias to keep freeze and unfreeze actions balanced
-	 * in the blk_mq_* namespace
-	 */
-	blk_freeze_queue(q);
-}
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
+EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
 void blk_unfreeze_queue(struct request_queue *q)
 {
@@ -2248,7 +2239,7 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
 	lockdep_assert_held(&set->tag_list_lock);
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
 		blk_unfreeze_queue(q);
 	}
@@ -2683,7 +2674,7 @@ static int __blk_mq_update_nr_requests(struct request_queue *q,
 	if (!set)
 		return -EINVAL;
 
-	blk_mq_freeze_queue(q);
+	blk_freeze_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2747,7 +2738,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
 	blk_mq_update_queue_map(set);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1b9742eb7399..7ce29ef1e6f3 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,6 @@ struct blk_mq_ctx {
 } ____cacheline_aligned_in_smp;
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 371c8165c9e8..1164c8a3720f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,7 +967,7 @@ static int elevator_switch_mq(struct request_queue *q,
 {
 	int ret;
 
-	blk_mq_freeze_queue(q);
+	blk_freeze_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c11ea44d470..b2e708b7e1e6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -211,7 +211,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup
 	 * will get updated by ioctl(LOOP_GET_STATUS)
 	 */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 	lo->use_dio = use_dio;
 	if (use_dio)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
@@ -599,7 +599,7 @@ static int loop_switch(struct loop_device *lo, struct file *file)
 	w.file = file;
 
 	/* freeze queue and wait for completion of scheduled requests */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	/* do the switch action */
 	do_loop_switch(lo, &w);
@@ -1046,7 +1046,7 @@ static int loop_clr_fd(struct loop_device *lo)
 		return -EINVAL;
 
 	/* freeze request queue during the transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_rundown;
@@ -1116,7 +1116,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 		return -EINVAL;
 
 	/* I/O need to be drained during transfer transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	err = loop_release_xfer(lo);
 	if (err)
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b008b6a98098..3a97ffcb3a81 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6347,7 +6347,7 @@ static ssize_t do_rbd_remove(struct bus_type *bus,
 		 * Prevent new IO from being queued and wait for existing
 		 * IO to complete/fail.
 		 */
-		blk_mq_freeze_queue(rbd_dev->disk->queue);
+		blk_freeze_queue(rbd_dev->disk->queue);
 		blk_set_queue_dying(rbd_dev->disk->queue);
 	}
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5c76b0a96be2..986f2b4f9760 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1210,7 +1210,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	bs = 1 << ns->lba_shift;
 	ns->noiob = le16_to_cpu(id->noiob);
 
-	blk_mq_freeze_queue(disk->queue);
+	blk_freeze_queue(disk->queue);
 
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2572e5641568..8ae77e088c01 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -256,7 +256,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
-void blk_mq_freeze_queue(struct request_queue *q);
+void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
  2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
  2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

This patch just makes it explicitely.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8cf1f7cbef2b..4c532d8612e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
-		blk_mq_run_hw_queues(q, false);
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, false);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (2 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 5/8] block: tracking request allocation with q_usage_counter Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

The only change on legacy is that blk_drain_queue() is run
from blk_freeze_queue(), which is called in blk_cleanup_queue().

So this patch removes the explicite __blk_drain_queue() in
blk_cleanup_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 17 +++++++++++++++--
 block/blk-mq.c           |  8 +++++---
 block/blk.h              |  1 +
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d579501f24ba..ce2d3b6f6c62 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -530,6 +530,21 @@ static void __blk_drain_queue(struct request_queue *q, bool drain_all)
 }
 
 /**
+ * blk_drain_queue - drain requests from request_queue
+ * @q: queue to drain
+ *
+ * Drain requests from @q.  All pending requests are drained.
+ * The caller is responsible for ensuring that no new requests
+ * which need to be drained are queued.
+ */
+void blk_drain_queue(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	__blk_drain_queue(q, true);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/**
  * blk_queue_bypass_start - enter queue bypass mode
  * @q: queue of interest
  *
@@ -653,8 +668,6 @@ void blk_cleanup_queue(struct request_queue *q)
 	 */
 	blk_freeze_queue(q);
 	spin_lock_irq(lock);
-	if (!q->mq_ops)
-		__blk_drain_queue(q, true);
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c532d8612e1..24de78afbe9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,11 +131,13 @@ void blk_freeze_queue_start(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
-void blk_mq_freeze_queue_wait(struct request_queue *q)
+void blk_freeze_queue_wait(struct request_queue *q)
 {
+	if (!q->mq_ops)
+		blk_drain_queue(q);
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
@@ -160,7 +162,7 @@ void blk_freeze_queue(struct request_queue *q)
 	 * exported to drivers as the only user for unfreeze is blk_mq.
 	 */
 	blk_freeze_queue_start(q);
-	blk_mq_freeze_queue_wait(q);
+	blk_freeze_queue_wait(q);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
diff --git a/block/blk.h b/block/blk.h
index 6847c5435cca..242486e26a81 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -64,6 +64,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
+void blk_drain_queue(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 986f2b4f9760..d34a9ffaa940 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2778,7 +2778,7 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_freeze_queue_wait(ns->queue);
+		blk_freeze_queue_wait(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8ae77e088c01..f90d78eb85df 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -259,7 +259,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
-void blk_mq_freeze_queue_wait(struct request_queue *q);
+void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 5/8] block: tracking request allocation with q_usage_counter
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (3 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

This usage is basically same with blk-mq, so that we can
support to freeze queue easily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ce2d3b6f6c62..85b15833a7a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1405,16 +1405,21 @@ static struct request *blk_old_get_request(struct request_queue *q,
 					   unsigned int op, gfp_t gfp_mask)
 {
 	struct request *rq;
+	int ret = 0;
 
 	WARN_ON_ONCE(q->mq_ops);
 
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
+	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+	if (ret)
+		return ERR_PTR(ret);
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, op, NULL, gfp_mask);
 	if (IS_ERR(rq)) {
 		spin_unlock_irq(q->queue_lock);
+		blk_queue_exit(q);
 		return rq;
 	}
 
@@ -1586,6 +1591,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		blk_free_request(rl, req);
 		freed_request(rl, sync, rq_flags);
 		blk_put_rl(rl);
+		blk_queue_exit(q);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1867,8 +1873,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
+	blk_queue_enter_live(q);
 	req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		blk_queue_exit(q);
 		__wbt_done(q->rq_wb, wb_acct);
 		if (PTR_ERR(req) == -ENOMEM)
 			bio->bi_status = BLK_STS_RESOURCE;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (4 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 5/8] block: tracking request allocation with q_usage_counter Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-04 15:21   ` Bart Van Assche
  2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

The two APIs are required to allow request allocation of
RQF_PREEMPT when queue is preempt frozen.

The following two points have to be guaranteed for one queue:

1) preempt freezing can be started only after all in-progress
normal & preempt freezings are completed

2) normal freezing can be started only if in-progress preempt
freezing is completed

Because for normal freezing, once blk_mq_freeze_queue_wait()
is returned, we have to make sure no request is entering queue
any more.

rwsem should have been perfect for this kind of sync, but we need
to support nested normal freeze, so spin_lock and normal_freezing &
preempt_freezing flag are used for the sync between normal freeze
and preempt freeze.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
 block/blk.h            |  16 +++++++
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   4 ++
 5 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85b15833a7a5..2549b0a0535d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -899,6 +899,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (blkcg_init_queue(q))
 		goto fail_ref;
 
+	spin_lock_init(&q->freeze_lock);
+
 	return q;
 
 fail_ref:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 24de78afbe9a..54b8d8b9f40e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -118,16 +118,75 @@ void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
 }
 
-void blk_freeze_queue_start(struct request_queue *q)
+static bool queue_freeze_is_over(struct request_queue *q, bool preempt)
+{
+	/*
+	 * For preempt freeze, we simply call blk_queue_enter_live()
+	 * before allocating one request of RQF_PREEMPT, so we have
+	 * to check if queue is dead, otherwise we may hang on dead
+	 * queue.
+	 *
+	 * For normal freeze, no need to check blk_queue_dying()
+	 * because it is checked in blk_queue_enter().
+	 */
+	if (preempt)
+		return !(q->normal_freezing + q->preempt_freezing) ||
+			blk_queue_dying(q);
+	return !q->preempt_freezing;
+}
+
+static bool __blk_freeze_queue_start(struct request_queue *q, bool preempt)
 {
 	int freeze_depth;
+	bool start_freeze = true;
+
+	/*
+	 * Wait for completion of another kind of freezing.
+	 *
+	 * We have to sync between normal freeze and preempt
+	 * freeze. preempt freeze can only be started iff all
+	 * pending normal & preempt freezing are completed,
+	 * meantime normal freeze can be started only if there
+	 * isn't pending preempt freezing.
+	 *
+	 * rwsem should have been perfect for this kind of sync,
+	 * but we need to support nested normal freeze, so use
+	 * spin_lock with two flag for syncing between normal
+	 * freeze and preempt freeze.
+	 */
+	spin_lock(&q->freeze_lock);
+	wait_event_cmd(q->mq_freeze_wq,
+		       queue_freeze_is_over(q, preempt),
+		       spin_unlock(&q->freeze_lock),
+		       spin_lock(&q->freeze_lock));
+
+	if (preempt && blk_queue_dying(q)) {
+		start_freeze = false;
+		goto unlock;
+	}
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
+		if (preempt) {
+			q->preempt_freezing = 1;
+			q->preempt_unfreezing = 0;
+		} else
+			q->normal_freezing = 1;
+		spin_unlock(&q->freeze_lock);
+
 		percpu_ref_kill(&q->q_usage_counter);
 		if (q->mq_ops)
 			blk_mq_run_hw_queues(q, false);
-	}
+	} else
+ unlock:
+		spin_unlock(&q->freeze_lock);
+
+	return start_freeze;
+}
+
+void blk_freeze_queue_start(struct request_queue *q)
+{
+	__blk_freeze_queue_start(q, false);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
@@ -166,7 +225,7 @@ void blk_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_unfreeze_queue(struct request_queue *q)
+static void __blk_unfreeze_queue(struct request_queue *q, bool preempt)
 {
 	int freeze_depth;
 
@@ -174,12 +233,67 @@ void blk_unfreeze_queue(struct request_queue *q)
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
+
+		/*
+		 * clearing the freeze flag so that any pending
+		 * freeze can move on
+		 */
+		spin_lock(&q->freeze_lock);
+		if (preempt)
+			q->preempt_freezing = 0;
+		else
+			q->normal_freezing = 0;
+		spin_unlock(&q->freeze_lock);
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
+
+void blk_unfreeze_queue(struct request_queue *q)
+{
+	__blk_unfreeze_queue(q, false);
+}
 EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
+ * Once this function is returned, only allow to get request
+ * of RQF_PREEMPT.
+ */
+void blk_freeze_queue_preempt(struct request_queue *q)
+{
+	/*
+	 * If queue isn't in preempt_frozen, the queue has
+	 * to be dying, so do nothing since no I/O can
+	 * succeed any more.
+	 */
+	if (__blk_freeze_queue_start(q, true))
+		blk_freeze_queue_wait(q);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_preempt);
+
+/*
+ * It is the caller's responsibility to make sure no new
+ * request is allocated before calling this function.
+ */
+void blk_unfreeze_queue_preempt(struct request_queue *q)
+{
+	/*
+	 * If queue isn't in preempt_frozen, the queue should
+	 * be dying , so do nothing since no I/O can succeed.
+	 */
+	if (blk_queue_is_preempt_frozen(q)) {
+
+		/* no new request can be coming after unfreezing */
+		spin_lock(&q->freeze_lock);
+		q->preempt_unfreezing = 1;
+		spin_unlock(&q->freeze_lock);
+
+		blk_freeze_queue_wait(q);
+		__blk_unfreeze_queue(q, true);
+	}
+}
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue_preempt);
+
+/*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
  */
diff --git a/block/blk.h b/block/blk.h
index 242486e26a81..28e9be6a14c6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -80,6 +80,22 @@ static inline void blk_queue_enter_live(struct request_queue *q)
 	percpu_ref_get(&q->q_usage_counter);
 }
 
+static inline bool blk_queue_is_preempt_frozen(struct request_queue *q)
+{
+	bool preempt_frozen;
+	bool preempt_unfreezing;
+
+	if (!percpu_ref_is_dying(&q->q_usage_counter))
+		return false;
+
+	spin_lock(&q->freeze_lock);
+	preempt_frozen = q->preempt_freezing;
+	preempt_unfreezing = q->preempt_unfreezing;
+	spin_unlock(&q->freeze_lock);
+
+	return preempt_frozen && !preempt_unfreezing;
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f90d78eb85df..5ae8c82d6273 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -258,6 +258,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
+void blk_freeze_queue_preempt(struct request_queue *q);
+void blk_unfreeze_queue_preempt(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f45f157b2910..5618d174100a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -565,6 +565,10 @@ struct request_queue {
 
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
+	spinlock_t		freeze_lock;
+	unsigned		normal_freezing:1;
+	unsigned		preempt_freezing:1;
+	unsigned		preempt_unfreezing:1;
 
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (5 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:12   ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
  2017-09-02 14:47 ` [PATCH V3 0/8] block/scsi: safe SCSI quiescing Oleksandr Natalenko
  8 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() to allow block
layer to allocate request when queue is preempt frozen, since we
will preempt freeze queue before quiescing SCSI device in the
following patch for supporting safe SCSI quiescing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 28 ++++++++++++++++++++--------
 block/blk-mq.c         | 14 ++++++++++++--
 include/linux/blk-mq.h |  7 ++++---
 include/linux/blkdev.h | 17 +++++++++++++++--
 4 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2549b0a0535d..f7a6fbb87dea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1404,7 +1404,8 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-					   unsigned int op, gfp_t gfp_mask)
+					   unsigned int op, gfp_t gfp_mask,
+					   unsigned int flags)
 {
 	struct request *rq;
 	int ret = 0;
@@ -1414,9 +1415,20 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
-	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+	/*
+	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
+	 * No normal freezing can be started when preempt freezing
+	 * is in-progress, and queue dying is checked before starting
+	 * preempt freezing, so it is safe to use blk_queue_enter_live()
+	 * in case of preempt freezing.
+	 */
+	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
+		blk_queue_enter_live(q);
+	else
+		ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
 	if (ret)
 		return ERR_PTR(ret);
+
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, op, NULL, gfp_mask);
 	if (IS_ERR(rq)) {
@@ -1432,26 +1444,26 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-				gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+				  gfp_t gfp_mask, unsigned int flags)
 {
 	struct request *req;
 
 	if (q->mq_ops) {
 		req = blk_mq_alloc_request(q, op,
-			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-				0 : BLK_MQ_REQ_NOWAIT);
+			flags | ((gfp_mask & __GFP_DIRECT_RECLAIM) ?
+				0 : BLK_MQ_REQ_NOWAIT));
 		if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
 			q->mq_ops->initialize_rq_fn(req);
 	} else {
-		req = blk_old_get_request(q, op, gfp_mask);
+		req = blk_old_get_request(q, op, gfp_mask, flags);
 		if (!IS_ERR(req) && q->initialize_rq_fn)
 			q->initialize_rq_fn(req);
 	}
 
 	return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 54b8d8b9f40e..e81001d1da27 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -496,9 +496,19 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 {
 	struct blk_mq_alloc_data alloc_data = { .flags = flags };
 	struct request *rq;
-	int ret;
+	int ret = 0;
 
-	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+	/*
+	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
+	 * No normal freezing can be started when preempt freezing
+	 * is in-progress, and queue dying is checked before starting
+	 * preempt freezing, so it is safe to use blk_queue_enter_live()
+	 * in case of preempt freezing.
+	 */
+	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
+		blk_queue_enter_live(q);
+	else
+		ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5ae8c82d6273..596f433eb54c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -200,9 +200,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
-	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
-	BLK_MQ_REQ_INTERNAL	= (1 << 2), /* allocate internal/sched tag */
+	BLK_MQ_REQ_PREEMPT	= BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT */
+	BLK_MQ_REQ_NOWAIT	= (1 << 8), /* return when out of requests */
+	BLK_MQ_REQ_RESERVED	= (1 << 9), /* allocate from reserved pool */
+	BLK_MQ_REQ_INTERNAL	= (1 << 10), /* allocate internal/sched tag */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5618d174100a..ff371c42eb3f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,6 +862,11 @@ enum {
 	BLKPREP_INVALID,	/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to __blk_get_request */
+enum {
+	BLK_REQ_PREEMPT	= (1 << 0), /* allocate for RQF_PREEMPT */
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -944,8 +949,9 @@ extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
 extern void __blk_put_request(struct request_queue *, struct request *);
-extern struct request *blk_get_request(struct request_queue *, unsigned int op,
-				       gfp_t gfp_mask);
+extern struct request *__blk_get_request(struct request_queue *,
+					 unsigned int op, gfp_t gfp_mask,
+					 unsigned int flags);
 extern void blk_requeue_request(struct request_queue *, struct request *);
 extern int blk_lld_busy(struct request_queue *q);
 extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
@@ -996,6 +1002,13 @@ blk_status_t errno_to_blk_status(int errno);
 
 bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
 
+static inline struct request *blk_get_request(struct request_queue *q,
+					      unsigned int op,
+					      gfp_t gfp_mask)
+{
+	return __blk_get_request(q, op, gfp_mask, 0);
+}
+
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
 	return bdev->bd_disk->queue;	/* this is never NULL */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (6 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 14:47 ` [PATCH V3 0/8] block/scsi: safe SCSI quiescing Oleksandr Natalenko
  8 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
requests because all these allocated requests can't be dispatched
when device is put in QIUESCE. Then no request can be allocated
for RQF_PREEMPT, and system may hang somewhere, such as
When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch freezes block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_freeze_queue_preempt
is returned. And only RQF_PREEMPT can be allocated in preempt freeze.

This patch also uses __blk_get_request() for allocating
request with RQF_PREEMPT, so that the allocation can
succeed even though block queue is preempt frozen.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..e1ad135cb209 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -243,10 +243,12 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	struct request *req;
 	struct scsi_request *rq;
 	int ret = DRIVER_ERROR << 24;
+	unsigned flag = sdev->sdev_state == SDEV_QUIESCE ? BLK_REQ_PREEMPT : 0;
 
-	req = blk_get_request(sdev->request_queue,
+	req = __blk_get_request(sdev->request_queue,
 			data_direction == DMA_TO_DEVICE ?
-			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+			flag);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
@@ -2890,6 +2892,20 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
 	int err;
 
+	/*
+	 * Simply quiesing SCSI device isn't safe, it is easy
+	 * to use up requests because all these allocated requests
+	 * can't be dispatched when device is put in QIUESCE.
+	 * Then no request can be allocated and we may hang
+	 * somewhere, such as system suspend/resume.
+	 *
+	 * So we freeze block queue in preempt mode first, no new
+	 * normal request can enter queue any more, and all pending
+	 * requests are drained once blk_freeze_queue is returned.
+	 * Only RQF_PREEMPT is allowed in preempt freeze.
+	 */
+	blk_freeze_queue_preempt(sdev->request_queue);
+
 	mutex_lock(&sdev->state_mutex);
 	err = scsi_device_set_state(sdev, SDEV_QUIESCE);
 	mutex_unlock(&sdev->state_mutex);
@@ -2926,6 +2942,8 @@ void scsi_device_resume(struct scsi_device *sdev)
 	    scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
 		scsi_run_queue(sdev->request_queue);
 	mutex_unlock(&sdev->state_mutex);
+
+	blk_unfreeze_queue_preempt(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
@ 2017-09-02 13:12   ` Ming Lei
  2017-09-04  4:13     ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-09-02 13:12 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo

On Sat, Sep 02, 2017 at 09:08:39PM +0800, Ming Lei wrote:
> REQF_PREEMPT is a bit special because the request is required
> to be dispatched to lld even when SCSI device is quiesced.
> 
> So this patch introduces __blk_get_request() to allow block
> layer to allocate request when queue is preempt frozen, since we
> will preempt freeze queue before quiescing SCSI device in the
> following patch for supporting safe SCSI quiescing.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c       | 28 ++++++++++++++++++++--------
>  block/blk-mq.c         | 14 ++++++++++++--
>  include/linux/blk-mq.h |  7 ++++---
>  include/linux/blkdev.h | 17 +++++++++++++++--
>  4 files changed, 51 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2549b0a0535d..f7a6fbb87dea 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1404,7 +1404,8 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
>  }
>  
>  static struct request *blk_old_get_request(struct request_queue *q,
> -					   unsigned int op, gfp_t gfp_mask)
> +					   unsigned int op, gfp_t gfp_mask,
> +					   unsigned int flags)
>  {
>  	struct request *rq;
>  	int ret = 0;
> @@ -1414,9 +1415,20 @@ static struct request *blk_old_get_request(struct request_queue *q,
>  	/* create ioc upfront */
>  	create_io_context(gfp_mask, q->node);
>  
> -	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
> +	/*
> +	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
> +	 * No normal freezing can be started when preempt freezing
> +	 * is in-progress, and queue dying is checked before starting
> +	 * preempt freezing, so it is safe to use blk_queue_enter_live()
> +	 * in case of preempt freezing.
> +	 */
> +	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
> +		blk_queue_enter_live(q);
> +	else
> +		ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
>  	if (ret)
>  		return ERR_PTR(ret);
> +
>  	spin_lock_irq(q->queue_lock);
>  	rq = get_request(q, op, NULL, gfp_mask);
>  	if (IS_ERR(rq)) {
> @@ -1432,26 +1444,26 @@ static struct request *blk_old_get_request(struct request_queue *q,
>  	return rq;
>  }
>  
> -struct request *blk_get_request(struct request_queue *q, unsigned int op,
> -				gfp_t gfp_mask)
> +struct request *__blk_get_request(struct request_queue *q, unsigned int op,
> +				  gfp_t gfp_mask, unsigned int flags)
>  {
>  	struct request *req;
>  
>  	if (q->mq_ops) {
>  		req = blk_mq_alloc_request(q, op,
> -			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
> -				0 : BLK_MQ_REQ_NOWAIT);
> +			flags | ((gfp_mask & __GFP_DIRECT_RECLAIM) ?
> +				0 : BLK_MQ_REQ_NOWAIT));
>  		if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
>  			q->mq_ops->initialize_rq_fn(req);
>  	} else {
> -		req = blk_old_get_request(q, op, gfp_mask);
> +		req = blk_old_get_request(q, op, gfp_mask, flags);
>  		if (!IS_ERR(req) && q->initialize_rq_fn)
>  			q->initialize_rq_fn(req);
>  	}
>  
>  	return req;
>  }
> -EXPORT_SYMBOL(blk_get_request);
> +EXPORT_SYMBOL(__blk_get_request);
>  
>  /**
>   * blk_requeue_request - put a request back on queue
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 54b8d8b9f40e..e81001d1da27 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -496,9 +496,19 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
>  {
>  	struct blk_mq_alloc_data alloc_data = { .flags = flags };
>  	struct request *rq;
> -	int ret;
> +	int ret = 0;
>  
> -	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
> +	/*
> +	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
> +	 * No normal freezing can be started when preempt freezing
> +	 * is in-progress, and queue dying is checked before starting
> +	 * preempt freezing, so it is safe to use blk_queue_enter_live()
> +	 * in case of preempt freezing.
> +	 */
> +	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
> +		blk_queue_enter_live(q);
> +	else
> +		ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
>  	if (ret)
>  		return ERR_PTR(ret);
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 5ae8c82d6273..596f433eb54c 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -200,9 +200,10 @@ void blk_mq_free_request(struct request *rq);
>  bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
>  
>  enum {
> -	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
> -	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
> -	BLK_MQ_REQ_INTERNAL	= (1 << 2), /* allocate internal/sched tag */
> +	BLK_MQ_REQ_PREEMPT	= BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT */
> +	BLK_MQ_REQ_NOWAIT	= (1 << 8), /* return when out of requests */
> +	BLK_MQ_REQ_RESERVED	= (1 << 9), /* allocate from reserved pool */
> +	BLK_MQ_REQ_INTERNAL	= (1 << 10), /* allocate internal/sched tag */
>  };
>  
>  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5618d174100a..ff371c42eb3f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -862,6 +862,11 @@ enum {
>  	BLKPREP_INVALID,	/* invalid command, kill, return -EREMOTEIO */
>  };
>  
> +/* passed to __blk_get_request */
> +enum {
> +	BLK_REQ_PREEMPT	= (1 << 0), /* allocate for RQF_PREEMPT */
> +};
> +
>  extern unsigned long blk_max_low_pfn, blk_max_pfn;
>  
>  /*
> @@ -944,8 +949,9 @@ extern void blk_rq_init(struct request_queue *q, struct request *rq);
>  extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
>  extern void blk_put_request(struct request *);
>  extern void __blk_put_request(struct request_queue *, struct request *);
> -extern struct request *blk_get_request(struct request_queue *, unsigned int op,
> -				       gfp_t gfp_mask);
> +extern struct request *__blk_get_request(struct request_queue *,
> +					 unsigned int op, gfp_t gfp_mask,
> +					 unsigned int flags);
>  extern void blk_requeue_request(struct request_queue *, struct request *);
>  extern int blk_lld_busy(struct request_queue *q);
>  extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
> @@ -996,6 +1002,13 @@ blk_status_t errno_to_blk_status(int errno);
>  
>  bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
>  
> +static inline struct request *blk_get_request(struct request_queue *q,
> +					      unsigned int op,
> +					      gfp_t gfp_mask)
> +{
> +	return __blk_get_request(q, op, gfp_mask, 0);
> +}
> +
>  static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>  {
>  	return bdev->bd_disk->queue;	/* this is never NULL */
> -- 
> 2.9.5
> 

Hi Bart,

Please let us know if V3 addresses your previous concern about calling
blk_queue_enter_live() during preempt freezing.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 0/8] block/scsi: safe SCSI quiescing
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (7 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
@ 2017-09-02 14:47 ` Oleksandr Natalenko
  8 siblings, 0 replies; 24+ messages in thread
From: Oleksandr Natalenko @ 2017-09-02 14:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley,
	Johannes Thumshirn, Tejun Heo

Again,

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

On sobota 2. září 2017 15:08:32 CEST Ming Lei wrote:
> Hi,
> 
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> 
> Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
> can be dispatched to SCSI successfully, and scsi_device_quiesce() just
> simply waits for completion of I/Os dispatched to SCSI stack. It isn't
> enough at all.
> 
> Because new request still can be allocated, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated, and system may
> hang forever, such as during system suspend or SCSI domain alidation.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt freez, and tries to solve the issue
> by preempt freezing block queue during SCSI quiesce, and allows
> to allocate request of RQF_PREEMPT when queue is preempt-frozen.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all by introducing blk_freeze_queue_preempt() and
> blk_unfreeze_queue_preempt(), also unifying current interfaces for
> freezing queue between block legacy and blk-mq.
> 
> Oleksandr has verified that this patchset V2 fixes his I/O hang
> during suspend/resume cycle.
> 
> V3:
> 	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
> 	- call blk_queue_enter_live() only when queue is preempt frozen
> 	- cleanup a bit on the implementation of preempt freeze
> 	- only patch 6 and 7 are changed
> 
> V2:
> 	- drop the 1st patch in V1 because percpu_ref_is_dying() is
> 	enough as pointed by Tejun
> 	- introduce preempt version of blk_[freeze|unfreeze]_queue
> 	- sync between preempt freeze and normal freeze
> 	- fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013&r=3&w=2
> 
> 
> 
> Ming Lei (8):
>   blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
>   blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
>   blk-mq: only run hw queues for blk-mq
>   blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
>   block: tracking request allocation with q_usage_counter
>   block: introduce preempt version of blk_[freeze|unfreeze]_queue
>   block: allow to allocate req with REQF_PREEMPT when queue is preempt
>     frozen
>   SCSI: preempt freeze block queue when SCSI device is put into quiesce
> 
>  block/bfq-iosched.c      |   2 +-
>  block/blk-cgroup.c       |   8 +--
>  block/blk-core.c         |  53 ++++++++++++---
>  block/blk-mq.c           | 170
> +++++++++++++++++++++++++++++++++++++++-------- block/blk-mq.h           | 
>  1 -
>  block/blk.h              |  17 +++++
>  block/elevator.c         |   4 +-
>  drivers/block/loop.c     |  16 ++---
>  drivers/block/rbd.c      |   2 +-
>  drivers/nvme/host/core.c |   8 +--
>  drivers/scsi/scsi_lib.c  |  22 +++++-
>  include/linux/blk-mq.h   |  15 +++--
>  include/linux/blkdev.h   |  21 +++++-
>  13 files changed, 273 insertions(+), 66 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-02 13:12   ` Ming Lei
@ 2017-09-04  4:13     ` Bart Van Assche
  2017-09-04  7:16       ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-09-04  4:13 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, hch@infradead.org, Bart Van Assche,
	martin.petersen@oracle.com, ming.lei@redhat.com,
	linux-scsi@vger.kernel.org, axboe@fb.com, jejb@linux.vnet.ibm.com
  Cc: tj@kernel.org, jthumshirn@suse.de, oleksandr@natalenko.name

On Sat, 2017-09-02 at 21:12 +0800, Ming Lei wrote:
> Please let us know if V3 addresses your previous concern about calling
> blk_queue_enter_live() during preempt freezing.

Do you understand how request queue cleanup works? The algorithm used for
request queue cleanup is as follows:
* Set the DYING flag. This flag makes all later blk_get_request() calls
  fail.
* Wait until all pending requests have finished.
* Set the DEAD flag. For the traditional block layer, this flag causes
  blk_run_queue() not to call .request_fn() anymore. For blk-mq it is
  guaranteed in another way that .queue_rq() won't be called anymore after
  this flag has been set.

Allowing blk_get_request() to succeed after the DYING flag has been set is
completely wrong because that could result in a request being queued after
the DEAD flag has been set, resulting in either a hanging request or a kernel
crash. This is why it's completely wrong to add a blk_queue_enter_live() call
in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
patch that adds a blk_queue_enter_live() call to any function called from
blk_get_request(). That includes the patch at the start of this e-mail thread.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04  4:13     ` Bart Van Assche
@ 2017-09-04  7:16       ` Ming Lei
  2017-09-04 15:40         ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-09-04  7:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	martin.petersen@oracle.com, linux-scsi@vger.kernel.org,
	axboe@fb.com, jejb@linux.vnet.ibm.com, tj@kernel.org,
	jthumshirn@suse.de, oleksandr@natalenko.name

On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> On Sat, 2017-09-02 at 21:12 +0800, Ming Lei wrote:
> > Please let us know if V3 addresses your previous concern about calling
> > blk_queue_enter_live() during preempt freezing.
> 
> Do you understand how request queue cleanup works? The algorithm used for
> request queue cleanup is as follows:
> * Set the DYING flag. This flag makes all later blk_get_request() calls
>   fail.

If you look at the __blk_freeze_queue_start(), you will see that preemp
freeze will fail after queue is dying, and blk_queue_enter_live() won't
be called at all if queue is dying, right?

> * Wait until all pending requests have finished.
> * Set the DEAD flag. For the traditional block layer, this flag causes
>   blk_run_queue() not to call .request_fn() anymore. For blk-mq it is
>   guaranteed in another way that .queue_rq() won't be called anymore after
>   this flag has been set.
> 
> Allowing blk_get_request() to succeed after the DYING flag has been set is
> completely wrong because that could result in a request being queued after

See above, this patch changes nothing about this fact, please look at
the patch carefully next time just before posting your long comment.

> the DEAD flag has been set, resulting in either a hanging request or a kernel
> crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> patch that adds a blk_queue_enter_live() call to any function called from
> blk_get_request(). That includes the patch at the start of this e-mail thread.
> 
> Bart.

-- 
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
@ 2017-09-04 15:21   ` Bart Van Assche
  2017-09-04 16:20     ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-09-04 15:21 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, hch@infradead.org, Bart Van Assche,
	martin.petersen@oracle.com, ming.lei@redhat.com,
	linux-scsi@vger.kernel.org, axboe@fb.com, jejb@linux.vnet.ibm.com
  Cc: tj@kernel.org, jthumshirn@suse.de, oleksandr@natalenko.name

On Sat, 2017-09-02 at 21:08 +0800, Ming Lei wrote:
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -565,6 +565,10 @@ struct request_queue {
>  
>  	int			bypass_depth;
>  	atomic_t		mq_freeze_depth;
> +	spinlock_t		freeze_lock;
> +	unsigned		normal_freezing:1;
> +	unsigned		preempt_freezing:1;
> +	unsigned		preempt_unfreezing:1;
>  
>  #if defined(CONFIG_BLK_DEV_BSG)
>  	bsg_job_fn		*bsg_job_fn;

Requests queues already have to many states and you want to make request queues
even more complicated by introducing several new state variables? Yikes!

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04  7:16       ` Ming Lei
@ 2017-09-04 15:40         ` Bart Van Assche
  2017-09-04 16:08           ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-09-04 15:40 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > completely wrong because that could result in a request being queued after
> > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > patch that adds a blk_queue_enter_live() call to any function called from
> > blk_get_request(). That includes the patch at the start of this e-mail thread.
>
> See above, this patch changes nothing about this fact, please look at
> the patch carefully next time just before posting your long comment.

Are you really sure that your patch does not allow blk_get_request() to
succeed after the DYING flag has been set? blk_mq_alloc_request() calls both
blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
any lock. A thread that is running concurrently with blk_mq_get_request()
can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
before blk_queue_enter_live() is called. This means that with your patch
series applied blk_get_request() can succeed after the DYING flag has been
set, which is something we don't want. Additionally, I don't think we want
to introduce any kind of locking in blk_mq_get_request() because that would
be a serialization point.

Have you considered to use the blk-mq "reserved request" mechanism to avoid
starvation of power management requests instead of making the block layer
even more complicated than it already is?

Note: extending blk_mq_freeze/unfreeze_queue() to the legacy block layer
could be useful to make scsi_wait_for_queuecommand() more elegant. However,
I don't think we should spend our time on legacy block layer / SCSI core
changes. The code I'm referring to is the following:

/**
 * scsi_wait_for_queuecommand() - wait for ongoing queuecommand() calls
 * @sdev: SCSI device pointer.
 *
 * Wait until the ongoing shost->hostt->queuecommand() calls that are
 * invoked from scsi_request_fn() have finished.
 */
static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
{
	WARN_ON_ONCE(sdev->host->use_blk_mq);

	while (scsi_request_fn_active(sdev))
		msleep(20);
}

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 15:40         ` Bart Van Assche
@ 2017-09-04 16:08           ` Ming Lei
  2017-09-04 16:18             ` Bart Van Assche
  2017-09-05  1:40             ` Bart Van Assche
  0 siblings, 2 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-04 16:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> > On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > > completely wrong because that could result in a request being queued after
> > > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > > patch that adds a blk_queue_enter_live() call to any function called from
> > > blk_get_request(). That includes the patch at the start of this e-mail thread.
> >
> > See above, this patch changes nothing about this fact, please look at
> > the patch carefully next time just before posting your long comment.
> 
> Are you really sure that your patch does not allow blk_get_request() to
> succeed after the DYING flag has been set? blk_mq_alloc_request() calls both

Yeah, I am pretty sure.

Firstly blk_queue_freeze_preempt() is exclusive, that means it will wait
for completion of all pending freezing(both normal and preempt), and other
freezing can't be started too if there is in-progress preempt
freezing, actually it is a typical read/write lock use case, but
we need to support nested normal freezing, so we can't use rwsem.

Also DYNING flag is checked first before starting preempt freezing, the
API will return and preempt_freezing flag isn't set if DYNING is set.

Secondly after preempt freezing is started:

	- for block legacy path, dying is always tested in the entry of
	__get_request(), so no new request is allocated after queue is dying.

	- for blk-mq, it is normal for the DYNING flag to be set just
	between blk_queue_enter() and allocating the request, because
	we depend on lld to handle the case. Even we can enhance the point
	by checking dying flag in blk_queue_enter(), but that is just
	a improvement, not mean V3 isn't correct. 

> blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
> any lock. A thread that is running concurrently with blk_mq_get_request()
> can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
> before blk_queue_enter_live() is called. This means that with your patch

preempt freezing is exclusive, so no other freezing can be started at all,
then no such issue you worried about.

> series applied blk_get_request() can succeed after the DYING flag has been
> set, which is something we don't want. Additionally, I don't think we want
> to introduce any kind of locking in blk_mq_get_request() because that would
> be a serialization point.

That needn't to be worried about, as you saw, we can check
percpu_ref_is_dying() first, then acquire the lock to check
flag if queue is freezing.

> 
> Have you considered to use the blk-mq "reserved request" mechanism to avoid
> starvation of power management requests instead of making the block layer
> even more complicated than it already is?

reserved request is really a bad idea, that means the reserved request
can't be used for normal I/O, we all know the request/tag space is
precious, and some device has a quite small tag space, such as sata.
This way will affect performance definitely.

Also I don't think the approach is complicated, and actually the idea
is simple, and the implementation isn't complicated too.

> 
> Note: extending blk_mq_freeze/unfreeze_queue() to the legacy block layer
> could be useful to make scsi_wait_for_queuecommand() more elegant. However,
> I don't think we should spend our time on legacy block layer / SCSI core
> changes. The code I'm referring to is the following:

That can be side-product of this approach, but this patchset is just
focus on fixing I/O hang.

-- 
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 16:08           ` Ming Lei
@ 2017-09-04 16:18             ` Bart Van Assche
  2017-09-04 16:28               ` Ming Lei
  2017-09-05  1:40             ` Bart Van Assche
  1 sibling, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-09-04 16:18 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> > > On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > > > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > > > completely wrong because that could result in a request being queued after
> > > > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > > > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > > > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > > > patch that adds a blk_queue_enter_live() call to any function called from
> > > > blk_get_request(). That includes the patch at the start of this e-mail thread.
> > > 
> > > See above, this patch changes nothing about this fact, please look at
> > > the patch carefully next time just before posting your long comment.
> > 
> > Are you really sure that your patch does not allow blk_get_request() to
> > succeed after the DYING flag has been set? blk_mq_alloc_request() calls both
> > blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
> > any lock. A thread that is running concurrently with blk_mq_get_request()
> > can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
> > before blk_queue_enter_live() is called. This means that with your patch
> > series applied blk_get_request() can succeed after the DYING flag has been
> > set, which is something we don't want. Additionally, I don't think we want
> > to introduce any kind of locking in blk_mq_get_request() because that would
> > be a serialization point.
>
> Yeah, I am pretty sure.
> 
> Firstly blk_queue_freeze_preempt() is exclusive, that means it will wait
> for completion of all pending freezing(both normal and preempt), and other
> freezing can't be started too if there is in-progress preempt
> freezing, actually it is a typical read/write lock use case, but
> we need to support nested normal freezing, so we can't use rwsem. 

You seem to overlook that blk_get_request() can be called from another thread
than the thread that is performing the freezing and unfreezing.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-04 15:21   ` Bart Van Assche
@ 2017-09-04 16:20     ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-04 16:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	martin.petersen@oracle.com, linux-scsi@vger.kernel.org,
	axboe@fb.com, jejb@linux.vnet.ibm.com, tj@kernel.org,
	jthumshirn@suse.de, oleksandr@natalenko.name

On Mon, Sep 04, 2017 at 03:21:08PM +0000, Bart Van Assche wrote:
> On Sat, 2017-09-02 at 21:08 +0800, Ming Lei wrote:
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -565,6 +565,10 @@ struct request_queue {
> >  
> >  	int			bypass_depth;
> >  	atomic_t		mq_freeze_depth;
> > +	spinlock_t		freeze_lock;
> > +	unsigned		normal_freezing:1;
> > +	unsigned		preempt_freezing:1;
> > +	unsigned		preempt_unfreezing:1;
> >  
> >  #if defined(CONFIG_BLK_DEV_BSG)
> >  	bsg_job_fn		*bsg_job_fn;
> 
> Requests queues already have to many states and you want to make request queues
> even more complicated by introducing several new state variables? Yikes!

The three flags are used in freeze/unfreeze path only, and I don't think
they are too complicated to maintain. Actually each state are simply
enough:

	- normal_freezing means the queue is in normal freezing, it is set
	before blk_queue_freeze() returns. In this state, no any request
	can be allocated from the queue, just like current blk queue
	freezing.

	- preempt_freezing means the queue is in preempt freezing, the flag
	is set before blk_queue_freeze_preempt() returns successfully. In
	this state, only RQF_PREEMPT is allowed to be allocated.

	- preempt_unfreezing means the queue is in preempt unfreezing, just
	set in the entry of blk_queue_unfreeze_preempt(). In this state,
	no any request can be allocated from the queue.

-- 
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 16:18             ` Bart Van Assche
@ 2017-09-04 16:28               ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-04 16:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Mon, Sep 04, 2017 at 04:18:32PM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > > On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> > > > On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > > > > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > > > > completely wrong because that could result in a request being queued after
> > > > > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > > > > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > > > > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > > > > patch that adds a blk_queue_enter_live() call to any function called from
> > > > > blk_get_request(). That includes the patch at the start of this e-mail thread.
> > > > 
> > > > See above, this patch changes nothing about this fact, please look at
> > > > the patch carefully next time just before posting your long comment.
> > > 
> > > Are you really sure that your patch does not allow blk_get_request() to
> > > succeed after the DYING flag has been set? blk_mq_alloc_request() calls both
> > > blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
> > > any lock. A thread that is running concurrently with blk_mq_get_request()
> > > can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
> > > before blk_queue_enter_live() is called. This means that with your patch
> > > series applied blk_get_request() can succeed after the DYING flag has been
> > > set, which is something we don't want. Additionally, I don't think we want
> > > to introduce any kind of locking in blk_mq_get_request() because that would
> > > be a serialization point.
> >
> > Yeah, I am pretty sure.
> > 
> > Firstly blk_queue_freeze_preempt() is exclusive, that means it will wait
> > for completion of all pending freezing(both normal and preempt), and other
> > freezing can't be started too if there is in-progress preempt
> > freezing, actually it is a typical read/write lock use case, but
> > we need to support nested normal freezing, so we can't use rwsem. 
> 
> You seem to overlook that blk_get_request() can be called from another thread
> than the thread that is performing the freezing and unfreezing.

The freezing state of queue can be observed in all context, so this
issue isn't related with context, please see blk_queue_is_preempt_frozen()
which is called from blk_get_request():

        if (!percpu_ref_is_dying(&q->q_usage_counter))
                return false;

        spin_lock(&q->freeze_lock);
        preempt_frozen = q->preempt_freezing;
        preempt_unfreezing = q->preempt_unfreezing;
        spin_unlock(&q->freeze_lock);

        return preempt_frozen && !preempt_unfreezing;


If the queue is in preempt freezing, blk_queue_enter_live() is called
before allocating request, otherwise blk_queue_enter() is called.

BTW, we can add the check of queue dying in this helper as one
improvement.

-- 
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 16:08           ` Ming Lei
  2017-09-04 16:18             ` Bart Van Assche
@ 2017-09-05  1:40             ` Bart Van Assche
  2017-09-05  2:23               ` Ming Lei
  1 sibling, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-09-05  1:40 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > Have you considered to use the blk-mq "reserved request" mechanism to avoid
> > starvation of power management requests instead of making the block layer
> > even more complicated than it already is?
> 
> reserved request is really a bad idea, that means the reserved request
> can't be used for normal I/O, we all know the request/tag space is
> precious, and some device has a quite small tag space, such as sata.
> This way will affect performance definitely.

Sorry but I'm neither convinced that reserving a request for power management 
would be a bad idea nor that it would have a significant performance impact nor
that it would be complicated to implement. Have you noticed that the Linux ATA
implementation already reserves a request for internal use and thereby reduces
the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would like to
know if is whether the performance impact of reserving a request is more or less
than 1%.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-05  1:40             ` Bart Van Assche
@ 2017-09-05  2:23               ` Ming Lei
  2017-09-08  3:08                 ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-09-05  2:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Tue, Sep 05, 2017 at 01:40:11AM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > > Have you considered to use the blk-mq "reserved request" mechanism to avoid
> > > starvation of power management requests instead of making the block layer
> > > even more complicated than it already is?
> > 
> > reserved request is really a bad idea, that means the reserved request
> > can't be used for normal I/O, we all know the request/tag space is
> > precious, and some device has a quite small tag space, such as sata.
> > This way will affect performance definitely.
> 
> Sorry but I'm neither convinced that reserving a request for power management 
> would be a bad idea nor that it would have a significant performance impact nor
> that it would be complicated to implement. Have you noticed that the Linux ATA
> implementation already reserves a request for internal use and thereby reduces
> the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would like to
> know if is whether the performance impact of reserving a request is more or less
> than 1%.

Firstly we really can avoid the reservation, why do we have to
wast one precious tag just for PM, which may never happen on
one machine from its running. For SATA, the internal tag is for EH,
I believe the reservation is inevitable.

Secondly reserving one tag may decrease the concurrent I/O by 1,
that definitely hurts performance, especially for random I/O. Think
about why NVMe increases its queue depth so many. Not mention
there are some devices which have only one tag(.can_queue is 1),
how can you reserve one tag on this kind of device?

Finally bad result will follow if you reserve one tag for PM only.
Suppose it is doable to reserve one tag, did you consider
how to do that? Tag can be shared in host wide, do you want to
reserve one tag just for one request_queue?
	- If yes, lots of tag can be reserved/wasted for the unusual PM
	or sort of commands, even worse the whole tag space of HBA may
	not be enough for the reservation if there are lots of LUNs in
	this HBA.
	
	- If not, and you just reserve one tag for one HBA, then all
	PM commands share the one reservation. During suspend/resume, all
	these PM commands have to run exclusively(serialized) for diskes
	attached to the HBA, that will slow down the suspend/resume very
	much because there may be lots of LUNs in this HBA.

That is why I said reserving one tag is really bad, isn't it?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-05  2:23               ` Ming Lei
@ 2017-09-08  3:08                 ` Ming Lei
  2017-09-08 17:28                   ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-09-08  3:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Tue, Sep 05, 2017 at 10:23:25AM +0800, Ming Lei wrote:
> On Tue, Sep 05, 2017 at 01:40:11AM +0000, Bart Van Assche wrote:
> > On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > > On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > > > Have you considered to use the blk-mq "reserved request" mechanism to avoid
> > > > starvation of power management requests instead of making the block layer
> > > > even more complicated than it already is?
> > > 
> > > reserved request is really a bad idea, that means the reserved request
> > > can't be used for normal I/O, we all know the request/tag space is
> > > precious, and some device has a quite small tag space, such as sata.
> > > This way will affect performance definitely.
> > 
> > Sorry but I'm neither convinced that reserving a request for power management 
> > would be a bad idea nor that it would have a significant performance impact nor
> > that it would be complicated to implement. Have you noticed that the Linux ATA
> > implementation already reserves a request for internal use and thereby reduces
> > the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would like to
> > know if is whether the performance impact of reserving a request is more or less
> > than 1%.
> 
> Firstly we really can avoid the reservation, why do we have to
> wast one precious tag just for PM, which may never happen on
> one machine from its running. For SATA, the internal tag is for EH,
> I believe the reservation is inevitable.
> 
> Secondly reserving one tag may decrease the concurrent I/O by 1,
> that definitely hurts performance, especially for random I/O. Think
> about why NVMe increases its queue depth so many. Not mention
> there are some devices which have only one tag(.can_queue is 1),
> how can you reserve one tag on this kind of device?
> 
> Finally bad result will follow if you reserve one tag for PM only.
> Suppose it is doable to reserve one tag, did you consider
> how to do that? Tag can be shared in host wide, do you want to
> reserve one tag just for one request_queue?
> 	- If yes, lots of tag can be reserved/wasted for the unusual PM
> 	or sort of commands, even worse the whole tag space of HBA may
> 	not be enough for the reservation if there are lots of LUNs in
> 	this HBA.
> 	
> 	- If not, and you just reserve one tag for one HBA, then all
> 	PM commands share the one reservation. During suspend/resume, all
> 	these PM commands have to run exclusively(serialized) for diskes
> 	attached to the HBA, that will slow down the suspend/resume very
> 	much because there may be lots of LUNs in this HBA.
> 
> That is why I said reserving one tag is really bad, isn't it?

Hi Bart,

Looks I replied or clarified all your questions/comments on this
patchset.

So could you let me know if you still object this approach or
patchset?

If not, I think we need to move on and fix the real issue.

-- 
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-08  3:08                 ` Ming Lei
@ 2017-09-08 17:28                   ` Bart Van Assche
  2017-09-09  7:21                     ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-09-08 17:28 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Fri, 2017-09-08 at 11:08 +0800, Ming Lei wrote:
> Looks I replied or clarified all your questions/comments on this
> patchset.

No, you have not addressed all my comments, you know that you have not
addressed all my comments so you should not have written that you have
addressed all my comments. This patch series not only introduces ugly
changes in the request queue freeze mechanism but it also introduces an
unacceptable race condition between blk_get_request() and request queue
cleanup.

BTW, you don't have to spend more time on this patch series. I have
implemented an alternative and much cleaner approach to fix SCSI device
suspend. I'm currently testing that patch series.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-08 17:28                   ` Bart Van Assche
@ 2017-09-09  7:21                     ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-09-09  7:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	jthumshirn@suse.de, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, axboe@fb.com,
	oleksandr@natalenko.name, jejb@linux.vnet.ibm.com, tj@kernel.org

On Fri, Sep 08, 2017 at 05:28:16PM +0000, Bart Van Assche wrote:
> On Fri, 2017-09-08 at 11:08 +0800, Ming Lei wrote:
> > Looks I replied or clarified all your questions/comments on this
> > patchset.
> 
> No, you have not addressed all my comments, you know that you have not
> addressed all my comments so you should not have written that you have

I do not know.

> addressed all my comments. This patch series not only introduces ugly
> changes in the request queue freeze mechanism but it also introduces an
> unacceptable race condition between blk_get_request() and request queue
> cleanup.

So what is the race? Could you reply in original thread?

> 
> BTW, you don't have to spend more time on this patch series. I have
> implemented an alternative and much cleaner approach to fix SCSI device
> suspend. I'm currently testing that patch series.

You do not understand the issue at all, it is not only related with
suspend. Please take a look at scsi_execute(), each request run
via scsi_execute() need to be dispatched successfully even when
SCSI device is quiesced.

-- 
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-09-09  7:21 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
2017-09-02 13:08 ` [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq Ming Lei
2017-09-02 13:08 ` [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
2017-09-02 13:08 ` [PATCH V3 5/8] block: tracking request allocation with q_usage_counter Ming Lei
2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
2017-09-04 15:21   ` Bart Van Assche
2017-09-04 16:20     ` Ming Lei
2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
2017-09-02 13:12   ` Ming Lei
2017-09-04  4:13     ` Bart Van Assche
2017-09-04  7:16       ` Ming Lei
2017-09-04 15:40         ` Bart Van Assche
2017-09-04 16:08           ` Ming Lei
2017-09-04 16:18             ` Bart Van Assche
2017-09-04 16:28               ` Ming Lei
2017-09-05  1:40             ` Bart Van Assche
2017-09-05  2:23               ` Ming Lei
2017-09-08  3:08                 ` Ming Lei
2017-09-08 17:28                   ` Bart Van Assche
2017-09-09  7:21                     ` Ming Lei
2017-09-02 13:08 ` [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
2017-09-02 14:47 ` [PATCH V3 0/8] block/scsi: safe SCSI quiescing Oleksandr Natalenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).