* [PATCH v6 0/9] block, scsi, md: Improve suspend and resume
@ 2017-10-05 0:01 Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 1/9] md: Rename md_notifier into md_reboot_notifier Bart Van Assche
` (8 more replies)
0 siblings, 9 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche
Hello Jens,
It is known that during the resume following a hibernate, especially when
using an md RAID1 array created on top of SCSI devices, sometimes the
system hangs instead of coming up properly. This patch series fixes this
problem. This patch series is an alternative for Ming Lei's "block/scsi:
safe SCSI quiescing" patch series. The advantage of this patch series is
that a fix for the md driver has been included.
These patches have been tested on top of the block layer for-next branch.
Please consider these changes for kernel v4.15.
Thanks,
Bart.
Changes between v5 and v6:
- Split an md patch into two patches to make it easier to review the changes.
- For the md patch that suspends I/O while the system is frozen, switched back
to the freezer mechanism because this makes the code shorter and easier to
review.
- Changed blk_set/clear_preempt_only() from EXPORT_SYMBOL() into
EXPORT_SYMBOL_GPL().
- Made blk_set_preempt_only() behave as a test-and-set operation.
- Introduced blk_get_request_flags() and BLK_MQ_REQ_PREEMPT as requested by
Christoph and reduced the number of arguments of blk_queue_enter() back from
three to two.
- In scsi_device_quiesce(), moved the blk_mq_freeze_queue() call out of a
critical section. Made the explanation of why the synchronize_rcu() call
is necessary more detailed.
Changes between v4 and v5:
- Split blk_set_preempt_only() into two functions as requested by Christoph.
- Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate
a request while the system is frozen. This is a big help to identify drivers
that submit I/O while the system is frozen.
- Since kernel thread freezing is on its way out, reworked the approach for
avoiding that the md driver submits I/O while the system is frozen such that
the approach no longer depends on the kernel thread freeze mechanism.
- Fixed the (theoretical) deadlock in scsi_device_quiesce() that was identified
by Ming.
Changes between v3 and v4:
- Made sure that this patch series not only works for scsi-mq but also for
the legacy SCSI stack.
- Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment
that explains why that is safe.
- Reordered the patches in this patch series.
Changes between v2 and v3:
- Made md kernel threads freezable.
- Changed the approach for quiescing SCSI devices again.
- Addressed Ming's review comments.
Changes compared to v1 of this patch series:
- Changed the approach and rewrote the patch series.
Bart Van Assche (8):
md: Rename md_notifier into md_reboot_notifier
md: Introduce md_stop_all_writes()
md: Neither resync nor reshape while the system is frozen
block: Introduce blk_get_request_flags()
block: Introduce BLK_MQ_REQ_PREEMPT
ide, scsi: Tell the block layer at request allocation time about
preempt requests
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
block, scsi: Make SCSI device suspend and resume work reliably
Ming Lei (1):
block: Make q_usage_counter also track legacy requests
block/blk-core.c | 127 ++++++++++++++++++++++++++++++++++++++++--------
block/blk-mq.c | 16 +++---
block/blk-timeout.c | 2 +-
drivers/ide/ide-pm.c | 4 +-
drivers/md/md.c | 45 ++++++++++++++---
drivers/scsi/scsi_lib.c | 33 ++++++++-----
fs/block_dev.c | 4 +-
include/linux/blk-mq.h | 1 +
include/linux/blkdev.h | 11 ++++-
9 files changed, 189 insertions(+), 54 deletions(-)
--
2.14.2
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v6 1/9] md: Rename md_notifier into md_reboot_notifier
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 2/9] md: Introduce md_stop_all_writes() Bart Van Assche
` (7 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Shaohua Li, linux-raid, Hannes Reinecke,
Johannes Thumshirn
This avoids confusion with the pm notifier that will be added
through a later patch.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
drivers/md/md.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5d61049e7417..8933cafc212d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8966,7 +8966,7 @@ static int md_notify_reboot(struct notifier_block *this,
return NOTIFY_DONE;
}
-static struct notifier_block md_notifier = {
+static struct notifier_block md_reboot_notifier = {
.notifier_call = md_notify_reboot,
.next = NULL,
.priority = INT_MAX, /* before any real devices */
@@ -9003,7 +9003,7 @@ static int __init md_init(void)
blk_register_region(MKDEV(mdp_major, 0), 1UL<<MINORBITS, THIS_MODULE,
md_probe, NULL, NULL);
- register_reboot_notifier(&md_notifier);
+ register_reboot_notifier(&md_reboot_notifier);
raid_table_header = register_sysctl_table(raid_root_table);
md_geninit();
@@ -9243,7 +9243,7 @@ static __exit void md_exit(void)
unregister_blkdev(MD_MAJOR,"md");
unregister_blkdev(mdp_major, "mdp");
- unregister_reboot_notifier(&md_notifier);
+ unregister_reboot_notifier(&md_reboot_notifier);
unregister_sysctl_table(raid_table_header);
/* We cannot unload the modules while some process is
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 2/9] md: Introduce md_stop_all_writes()
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 1/9] md: Rename md_notifier into md_reboot_notifier Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 3/9] md: Neither resync nor reshape while the system is frozen Bart Van Assche
` (6 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Shaohua Li, linux-raid, Hannes Reinecke,
Johannes Thumshirn
Introduce md_stop_all_writes() because the next patch will add
a second caller for this function. This patch does not change
any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
drivers/md/md.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8933cafc212d..b99584e5d6b1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8937,8 +8937,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
}
EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
-static int md_notify_reboot(struct notifier_block *this,
- unsigned long code, void *x)
+static void md_stop_all_writes(void)
{
struct list_head *tmp;
struct mddev *mddev;
@@ -8962,6 +8961,12 @@ static int md_notify_reboot(struct notifier_block *this,
*/
if (need_delay)
mdelay(1000*1);
+}
+
+static int md_notify_reboot(struct notifier_block *this,
+ unsigned long code, void *x)
+{
+ md_stop_all_writes();
return NOTIFY_DONE;
}
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 3/9] md: Neither resync nor reshape while the system is frozen
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 1/9] md: Rename md_notifier into md_reboot_notifier Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 2/9] md: Introduce md_stop_all_writes() Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 4/9] block: Make q_usage_counter also track legacy requests Bart Van Assche
` (5 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Shaohua Li, linux-raid, Hannes Reinecke,
Johannes Thumshirn
Some people use the md driver on laptops and use the suspend and
resume functionality. Since it is essential that submitting of
new I/O requests stops before a hibernation image is created,
interrupt the md resync and reshape actions if the system is
being frozen. Note: the resync and reshape will restart after
the system is resumed and a message similar to the following
will appear in the system log:
md: md0: data-check interrupted.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
drivers/md/md.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b99584e5d6b1..d712d3320c1d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,8 @@
#include <linux/raid/md_u.h>
#include <linux/slab.h>
#include <linux/percpu-refcount.h>
+#include <linux/freezer.h>
+#include <linux/suspend.h>
#include <trace/events/block.h>
#include "md.h"
@@ -7439,6 +7441,7 @@ static int md_thread(void *arg)
*/
allow_signal(SIGKILL);
+ set_freezable();
while (!kthread_should_stop()) {
/* We need to wait INTERRUPTIBLE so that
@@ -7449,7 +7452,7 @@ static int md_thread(void *arg)
if (signal_pending(current))
flush_signals(current);
- wait_event_interruptible_timeout
+ wait_event_freezable_timeout
(thread->wqueue,
test_bit(THREAD_WAKEUP, &thread->flags)
|| kthread_should_stop() || kthread_should_park(),
@@ -8963,6 +8966,29 @@ static void md_stop_all_writes(void)
mdelay(1000*1);
}
+/*
+ * Ensure that neither resyncing nor reshaping occurs while the system is
+ * frozen.
+ */
+static int md_notify_pm(struct notifier_block *bl, unsigned long state,
+ void *unused)
+{
+ pr_debug("%s: state = %ld\n", __func__, state);
+
+ switch (state) {
+ case PM_HIBERNATION_PREPARE:
+ case PM_SUSPEND_PREPARE:
+ case PM_RESTORE_PREPARE:
+ md_stop_all_writes();
+ break;
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block md_pm_notifier = {
+ .notifier_call = md_notify_pm,
+};
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{
@@ -9009,6 +9035,7 @@ static int __init md_init(void)
md_probe, NULL, NULL);
register_reboot_notifier(&md_reboot_notifier);
+ register_pm_notifier(&md_pm_notifier);
raid_table_header = register_sysctl_table(raid_root_table);
md_geninit();
@@ -9248,6 +9275,7 @@ static __exit void md_exit(void)
unregister_blkdev(MD_MAJOR,"md");
unregister_blkdev(mdp_major, "mdp");
+ unregister_pm_notifier(&md_pm_notifier);
unregister_reboot_notifier(&md_reboot_notifier);
unregister_sysctl_table(raid_table_header);
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 4/9] block: Make q_usage_counter also track legacy requests
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
` (2 preceding siblings ...)
2017-10-05 0:01 ` [PATCH v6 3/9] md: Neither resync nor reshape while the system is frozen Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 5/9] block: Introduce blk_get_request_flags() Bart Van Assche
` (4 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Hannes Reinecke, Johannes Thumshirn
From: Ming Lei <ming.lei@redhat.com>
This patch makes it possible to pause request allocation for
the legacy block layer by calling blk_mq_freeze_queue() and
blk_mq_unfreeze_queue().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
[ bvanassche: Combined two patches into one, edited a comment and made sure
REQ_NOWAIT is handled properly in blk_old_get_request() ]
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
block/blk-core.c | 12 ++++++++++++
block/blk-mq.c | 10 ++--------
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 14f7674fa0b1..265a69511c47 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,9 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+ /* Make blk_queue_enter() reexamine the DYING flag. */
+ wake_up_all(&q->mq_freeze_wq);
}
EXPORT_SYMBOL_GPL(blk_set_queue_dying);
@@ -1395,16 +1398,22 @@ static struct request *blk_old_get_request(struct request_queue *q,
unsigned int op, gfp_t gfp_mask)
{
struct request *rq;
+ int ret = 0;
WARN_ON_ONCE(q->mq_ops);
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
+ ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
+ (op & REQ_NOWAIT));
+ if (ret)
+ return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+ blk_queue_exit(q);
return rq;
}
@@ -1576,6 +1585,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+ blk_queue_exit(q);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1857,8 +1867,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
* Grab a free request. This is might sleep but can not fail.
* Returns with the queue unlocked.
*/
+ blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+ blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f01d69879d6..ebf85e270509 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
- blk_mq_run_hw_queues(q, false);
+ if (q->mq_ops)
+ blk_mq_run_hw_queues(q, false);
}
}
EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
@@ -255,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
- /*
- * If we are called because the queue has now been marked as
- * dying, we need to ensure that processes currently waiting on
- * the queue are notified as well.
- */
- wake_up_all(&q->mq_freeze_wq);
}
bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 5/9] block: Introduce blk_get_request_flags()
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
` (3 preceding siblings ...)
2017-10-05 0:01 ` [PATCH v6 4/9] block: Make q_usage_counter also track legacy requests Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 6/9] block: Introduce BLK_MQ_REQ_PREEMPT Bart Van Assche
` (3 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Hannes Reinecke, Johannes Thumshirn
A side effect of this patch is that the GFP mask that is passed to
several allocation functions in the legacy block layer is changed
from GFP_KERNEL into __GFP_DIRECT_RECLAIM.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
block/blk-core.c | 50 +++++++++++++++++++++++++++++++++++---------------
include/linux/blkdev.h | 3 +++
2 files changed, 38 insertions(+), 15 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 265a69511c47..03156d9ede94 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1157,7 +1157,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
* @rl: request list to allocate from
* @op: operation and flags
* @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLQ_MQ_REQ_* flags
*
* Get a free request from @q. This function may fail under memory
* pressure or if @q is dead.
@@ -1167,7 +1167,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
* Returns request pointer on success, with @q->queue_lock *not held*.
*/
static struct request *__get_request(struct request_list *rl, unsigned int op,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, unsigned int flags)
{
struct request_queue *q = rl->q;
struct request *rq;
@@ -1176,6 +1176,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
struct io_cq *icq = NULL;
const bool is_sync = op_is_sync(op);
int may_queue;
+ gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+ __GFP_DIRECT_RECLAIM;
req_flags_t rq_flags = RQF_ALLOCED;
lockdep_assert_held(q->queue_lock);
@@ -1336,7 +1338,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
* @q: request_queue to allocate request from
* @op: operation and flags
* @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLK_MQ_REQ_* flags.
*
* Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
* this function keeps retrying under memory pressure and fails iff @q is dead.
@@ -1346,7 +1348,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
* Returns request pointer on success, with @q->queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, unsigned int op,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, unsigned int flags)
{
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1358,7 +1360,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
rl = blk_get_rl(q, bio); /* transferred to @rq on success */
retry:
- rq = __get_request(rl, op, bio, gfp_mask);
+ rq = __get_request(rl, op, bio, flags);
if (!IS_ERR(rq))
return rq;
@@ -1367,7 +1369,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
return ERR_PTR(-EAGAIN);
}
- if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
+ if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) {
blk_put_rl(rl);
return rq;
}
@@ -1394,10 +1396,13 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
goto retry;
}
+/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
static struct request *blk_old_get_request(struct request_queue *q,
- unsigned int op, gfp_t gfp_mask)
+ unsigned int op, unsigned int flags)
{
struct request *rq;
+ gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+ __GFP_DIRECT_RECLAIM;
int ret = 0;
WARN_ON_ONCE(q->mq_ops);
@@ -1410,7 +1415,7 @@ static struct request *blk_old_get_request(struct request_queue *q,
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
- rq = get_request(q, op, NULL, gfp_mask);
+ rq = get_request(q, op, NULL, flags);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
blk_queue_exit(q);
@@ -1424,25 +1429,40 @@ static struct request *blk_old_get_request(struct request_queue *q,
return rq;
}
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
- gfp_t gfp_mask)
+/**
+ * blk_get_request_flags - allocate a request
+ * @q: request queue to allocate a request for
+ * @op: operation (REQ_OP_*) and REQ_* flags, e.g. REQ_SYNC.
+ * @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT.
+ */
+struct request *blk_get_request_flags(struct request_queue *q, unsigned int op,
+ unsigned int flags)
{
struct request *req;
+ WARN_ON_ONCE(op & REQ_NOWAIT);
+ WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT);
+
if (q->mq_ops) {
- req = blk_mq_alloc_request(q, op,
- (gfp_mask & __GFP_DIRECT_RECLAIM) ?
- 0 : BLK_MQ_REQ_NOWAIT);
+ req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
- req = blk_old_get_request(q, op, gfp_mask);
+ req = blk_old_get_request(q, op, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
return req;
}
+EXPORT_SYMBOL(blk_get_request_flags);
+
+struct request *blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask)
+{
+ return blk_get_request_flags(q, op, gfp_mask & __GFP_DIRECT_RECLAIM ?
+ 0 : BLK_MQ_REQ_NOWAIT);
+}
EXPORT_SYMBOL(blk_get_request);
/**
@@ -1868,7 +1888,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
* Returns with the queue unlocked.
*/
blk_queue_enter_live(q);
- req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
+ req = get_request(q, bio->bi_opf, bio, 0);
if (IS_ERR(req)) {
blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..9a85f6ad4f89 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -940,6 +940,9 @@ extern void blk_rq_init(struct request_queue *q, struct request *rq);
extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
extern void blk_put_request(struct request *);
extern void __blk_put_request(struct request_queue *, struct request *);
+extern struct request *blk_get_request_flags(struct request_queue *,
+ unsigned int op,
+ unsigned int flags);
extern struct request *blk_get_request(struct request_queue *, unsigned int op,
gfp_t gfp_mask);
extern void blk_requeue_request(struct request_queue *, struct request *);
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 6/9] block: Introduce BLK_MQ_REQ_PREEMPT
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
` (4 preceding siblings ...)
2017-10-05 0:01 ` [PATCH v6 5/9] block: Introduce blk_get_request_flags() Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests Bart Van Assche
` (2 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Hannes Reinecke, Johannes Thumshirn
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to
blk_get_request_flags().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
block/blk-core.c | 4 +++-
block/blk-mq.c | 2 ++
include/linux/blk-mq.h | 1 +
3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 03156d9ede94..842e67e0b7ab 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1260,6 +1260,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
blk_rq_set_rl(rq, rl);
rq->cmd_flags = op;
rq->rq_flags = rq_flags;
+ if (flags & BLK_MQ_REQ_PREEMPT)
+ rq->rq_flags |= RQF_PREEMPT;
/* init elvpriv */
if (rq_flags & RQF_ELVPRIV) {
@@ -1441,7 +1443,7 @@ struct request *blk_get_request_flags(struct request_queue *q, unsigned int op,
struct request *req;
WARN_ON_ONCE(op & REQ_NOWAIT);
- WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT);
+ WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ebf85e270509..271657992d1a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -290,6 +290,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
rq->q = data->q;
rq->mq_ctx = data->ctx;
rq->cmd_flags = op;
+ if (data->flags & BLK_MQ_REQ_PREEMPT)
+ rq->rq_flags |= RQF_PREEMPT;
if (blk_queue_io_stat(data->q))
rq->rq_flags |= RQF_IO_STAT;
/* do not touch atomic flags, it needs atomic ops against the timer */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..ee9fc3fb7fca 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -200,6 +200,7 @@ enum {
BLK_MQ_REQ_NOWAIT = (1 << 0), /* return when out of requests */
BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+ BLK_MQ_REQ_PREEMPT = (1 << 3), /* set RQF_PREEMPT */
};
struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
` (5 preceding siblings ...)
2017-10-05 0:01 ` [PATCH v6 6/9] block: Introduce BLK_MQ_REQ_PREEMPT Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:42 ` David Miller
2017-10-05 0:01 ` [PATCH v6 8/9] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably Bart Van Assche
8 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, David S . Miller, Hannes Reinecke,
Johannes Thumshirn
Convert blk_get_request(q, op, __GFP_RECLAIM) into blk_get_request(q,
op, BLK_MQ_PREEMPT). This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
drivers/ide/ide-pm.c | 4 ++--
drivers/scsi/scsi_lib.c | 6 +++---
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 544f02d673ca..f56d742908df 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev)
}
memset(&rqpm, 0, sizeof(rqpm));
- rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
+ rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
+ BLK_MQ_REQ_PREEMPT);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
- rq->rq_flags |= RQF_PREEMPT;
rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..1c16a247fae6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
- req = blk_get_request(sdev->request_queue,
+ req = blk_get_request_flags(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
- REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+ REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
- req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+ req->rq_flags |= rq_flags | RQF_QUIET;
/*
* head injection *required* here otherwise quiesce won't work
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 8/9] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
` (6 preceding siblings ...)
2017-10-05 0:01 ` [PATCH v6 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably Bart Van Assche
8 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Hannes Reinecke, Johannes Thumshirn
This flag will be used in the next patch to let the block layer
core know whether or not a SCSI request queue has been quiesced.
A quiesced SCSI queue namely only processes RQF_PREEMPT requests.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
block/blk-core.c | 29 +++++++++++++++++++++++++++++
include/linux/blkdev.h | 6 ++++++
2 files changed, 35 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index 842e67e0b7ab..b8d90fc29b35 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,35 @@ void blk_sync_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_sync_queue);
+/**
+ * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY.
+ *
+ * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
+ * set and 1 if the flag was already set.
+ */
+int blk_set_preempt_only(struct request_queue *q)
+{
+ unsigned long flags;
+ int res;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return res;
+}
+EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+
+void blk_clear_preempt_only(struct request_queue *q)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+
/**
* __blk_run_queue_uncond - run a queue whether or not it has been stopped
* @q: The queue to run
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9a85f6ad4f89..c4174f78c6eb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -631,6 +631,7 @@ struct request_queue {
#define QUEUE_FLAG_REGISTERED 26 /* queue has been registered to a disk */
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
@@ -735,6 +736,11 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
REQ_FAILFAST_DRIVER))
#define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q) \
+ test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern int blk_set_preempt_only(struct request_queue *q);
+extern void blk_clear_preempt_only(struct request_queue *q);
static inline bool blk_account_rq(struct request *rq)
{
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
` (7 preceding siblings ...)
2017-10-05 0:01 ` [PATCH v6 8/9] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag Bart Van Assche
@ 2017-10-05 0:01 ` Bart Van Assche
2017-10-07 4:33 ` Ming Lei
8 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2017-10-05 0:01 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Ming Lei,
Bart Van Assche, Hannes Reinecke, Johannes Thumshirn
It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after resume the fio job
is still running:
for d in /sys/class/block/sd*[a-z]; do
hcil=$(readlink "$d/device")
hcil=${hcil#../../../}
echo 4 > "$d/queue/nr_requests"
echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
done
bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
bdev=${bdev#../../}
hcil=$(readlink "/sys/block/$bdev/device")
hcil=${hcil#../../../}
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
--ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
--loops=$((2**31)) &
pid=$!
sleep 1
systemctl hibernate
sleep 10
kill $pid
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
block/blk-core.c | 38 ++++++++++++++++++++++++++++++--------
block/blk-mq.c | 4 ++--
block/blk-timeout.c | 2 +-
drivers/scsi/scsi_lib.c | 27 +++++++++++++++++++--------
fs/block_dev.c | 4 ++--
include/linux/blkdev.h | 2 +-
6 files changed, 55 insertions(+), 22 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b8d90fc29b35..81a4bb119d50 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -371,6 +371,7 @@ void blk_clear_preempt_only(struct request_queue *q)
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+ wake_up_all(&q->mq_freeze_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
}
EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -792,15 +793,34 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
}
EXPORT_SYMBOL(blk_alloc_queue);
-int blk_queue_enter(struct request_queue *q, bool nowait)
+/**
+ * blk_queue_enter() - try to increase q->q_usage_counter
+ * @q: request queue pointer
+ * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ */
+int blk_queue_enter(struct request_queue *q, unsigned int flags)
{
+ const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+
while (true) {
int ret;
- if (percpu_ref_tryget_live(&q->q_usage_counter))
- return 0;
+ if (percpu_ref_tryget_live(&q->q_usage_counter)) {
+ /*
+ * The code that sets the PREEMPT_ONLY flag is
+ * responsible for ensuring that that flag is globally
+ * visible before the queue is unfrozen.
+ */
+ if (preempt || !blk_queue_preempt_only(q)) {
+ return 0;
+ } else {
+ percpu_ref_put(&q->q_usage_counter);
+ WARN_ONCE("%s: Attempt to allocate non-preempt request in preempt-only mode.\n",
+ kobject_name(q->kobj.parent));
+ }
+ }
- if (nowait)
+ if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
/*
@@ -813,7 +833,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
ret = wait_event_interruptible(q->mq_freeze_wq,
- !atomic_read(&q->mq_freeze_depth) ||
+ (atomic_read(&q->mq_freeze_depth) == 0 &&
+ (preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
@@ -1441,8 +1462,7 @@ static struct request *blk_old_get_request(struct request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
- ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
- (op & REQ_NOWAIT));
+ ret = blk_queue_enter(q, flags);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -2263,8 +2283,10 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = bio_list_on_stack;
do {
struct request_queue *q = bio->bi_disk->queue;
+ unsigned int flags = bio->bi_opf & REQ_NOWAIT ?
+ BLK_MQ_REQ_NOWAIT : 0;
- if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+ if (likely(blk_queue_enter(q, flags) == 0)) {
struct bio_list lower, same;
/* Create a fresh bio_list for all subordinate requests */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 271657992d1a..1604bc2d4a57 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -386,7 +386,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
struct request *rq;
int ret;
- ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+ ret = blk_queue_enter(q, flags);
if (ret)
return ERR_PTR(ret);
@@ -425,7 +425,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
if (hctx_idx >= q->nr_hw_queues)
return ERR_PTR(-EIO);
- ret = blk_queue_enter(q, true);
+ ret = blk_queue_enter(q, flags);
if (ret)
return ERR_PTR(ret);
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..b75d975cc5a5 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
- if (blk_queue_enter(q, true))
+ if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT))
return;
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 1c16a247fae6..0ba7af5debc7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2926,21 +2926,30 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
int
scsi_device_quiesce(struct scsi_device *sdev)
{
+ struct request_queue *q = sdev->request_queue;
int err;
+ blk_mq_freeze_queue(q);
+ if (blk_set_preempt_only(q)) {
+ blk_mq_unfreeze_queue(q);
+ return -EINVAL;
+ }
+ /*
+ * Ensure that the effect of blk_set_preempt_only() will be visible
+ * for percpu_ref_tryget() callers that occur after the queue
+ * unfreeze. See also https://lwn.net/Articles/573497/.
+ */
+ synchronize_rcu();
+ blk_mq_unfreeze_queue(q);
+
mutex_lock(&sdev->state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(&sdev->state_mutex);
if (err)
- return err;
+ blk_clear_preempt_only(q);
- scsi_run_queue(sdev->request_queue);
- while (atomic_read(&sdev->device_busy)) {
- msleep_interruptible(200);
- scsi_run_queue(sdev->request_queue);
- }
- return 0;
+ return err;
}
EXPORT_SYMBOL(scsi_device_quiesce);
@@ -2961,8 +2970,10 @@ void scsi_device_resume(struct scsi_device *sdev)
*/
mutex_lock(&sdev->state_mutex);
if (sdev->sdev_state == SDEV_QUIESCE &&
- scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
+ scsi_device_set_state(sdev, SDEV_RUNNING) == 0) {
+ blk_clear_preempt_only(sdev->request_queue);
scsi_run_queue(sdev->request_queue);
+ }
mutex_unlock(&sdev->state_mutex);
}
EXPORT_SYMBOL(scsi_device_resume);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..98cf2d7ee9d3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
- result = blk_queue_enter(bdev->bd_queue, false);
+ result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
- result = blk_queue_enter(bdev->bd_queue, false);
+ result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c4174f78c6eb..9f21a8bbacc6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -972,7 +972,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
-extern int blk_queue_enter(struct request_queue *q, bool nowait);
+extern int blk_queue_enter(struct request_queue *q, unsigned int flags);
extern void blk_queue_exit(struct request_queue *q);
extern void blk_start_queue(struct request_queue *q);
extern void blk_start_queue_async(struct request_queue *q);
--
2.14.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v6 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests
2017-10-05 0:01 ` [PATCH v6 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests Bart Van Assche
@ 2017-10-05 0:42 ` David Miller
0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2017-10-05 0:42 UTC (permalink / raw)
To: bart.vanassche
Cc: axboe, linux-block, hch, martin.petersen, oleksandr, mcgrof,
ming.lei, hare, jthumshirn
From: Bart Van Assche <bart.vanassche@wdc.com>
Date: Wed, 4 Oct 2017 17:01:08 -0700
> Convert blk_get_request(q, op, __GFP_RECLAIM) into blk_get_request(q,
> op, BLK_MQ_PREEMPT). This patch does not change any functionality.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
For IDE:
Acked-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably
2017-10-05 0:01 ` [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably Bart Van Assche
@ 2017-10-07 4:33 ` Ming Lei
2017-10-09 21:16 ` Bart Van Assche
0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2017-10-07 4:33 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
=Oleksandr Natalenko, Luis R . Rodriguez, Hannes Reinecke,
Johannes Thumshirn
On Wed, Oct 04, 2017 at 05:01:10PM -0700, Bart Van Assche wrote:
> It is essential during suspend and resume that neither the filesystem
> state nor the filesystem metadata in RAM changes. This is why while
> the hibernation image is being written or restored that SCSI devices
quiesce isn't used only for suspend and resume, And the issue isn't
suspend/resume specific too. So please change the title/commit log
as sort of 'make SCSI quiesce more reliable/safe'.
> are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
> and scsi_device_resume(). In the SDEV_QUIESCE state execution of
> non-preempt requests is deferred. This is realized by returning
> BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
> devices. Avoid that a full queue prevents power management requests
> to be submitted by deferring allocation of non-preempt requests for
> devices in the quiesced state. This patch has been tested by running
> the following commands and by verifying that after resume the fio job
> is still running:
>
> for d in /sys/class/block/sd*[a-z]; do
> hcil=$(readlink "$d/device")
> hcil=${hcil#../../../}
> echo 4 > "$d/queue/nr_requests"
> echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
> done
> bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
> bdev=${bdev#../../}
> hcil=$(readlink "/sys/block/$bdev/device")
> hcil=${hcil#../../../}
> fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
> --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
> --loops=$((2**31)) &
> pid=$!
> sleep 1
> systemctl hibernate
> sleep 10
> kill $pid
>
> Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
> block/blk-core.c | 38 ++++++++++++++++++++++++++++++--------
> block/blk-mq.c | 4 ++--
> block/blk-timeout.c | 2 +-
> drivers/scsi/scsi_lib.c | 27 +++++++++++++++++++--------
> fs/block_dev.c | 4 ++--
> include/linux/blkdev.h | 2 +-
> 6 files changed, 55 insertions(+), 22 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b8d90fc29b35..81a4bb119d50 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -371,6 +371,7 @@ void blk_clear_preempt_only(struct request_queue *q)
>
> spin_lock_irqsave(q->queue_lock, flags);
> queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> + wake_up_all(&q->mq_freeze_wq);
> spin_unlock_irqrestore(q->queue_lock, flags);
> }
> EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
> @@ -792,15 +793,34 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
> }
> EXPORT_SYMBOL(blk_alloc_queue);
>
> -int blk_queue_enter(struct request_queue *q, bool nowait)
> +/**
> + * blk_queue_enter() - try to increase q->q_usage_counter
> + * @q: request queue pointer
> + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + */
> +int blk_queue_enter(struct request_queue *q, unsigned int flags)
> {
> + const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
> +
> while (true) {
> int ret;
>
> - if (percpu_ref_tryget_live(&q->q_usage_counter))
> - return 0;
> + if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> + /*
> + * The code that sets the PREEMPT_ONLY flag is
> + * responsible for ensuring that that flag is globally
> + * visible before the queue is unfrozen.
> + */
> + if (preempt || !blk_queue_preempt_only(q)) {
PREEMPT_ONLY flag is checked without RCU read lock held, so the
synchronize_rcu() may just wait for completion of pre-exit
percpu_ref_tryget_live(), which can be reordered with the
reading on blk_queue_preempt_only().
> + return 0;
> + } else {
> + percpu_ref_put(&q->q_usage_counter);
> + WARN_ONCE("%s: Attempt to allocate non-preempt request in preempt-only mode.\n",
> + kobject_name(q->kobj.parent));
> + }
> + }
>
> - if (nowait)
> + if (flags & BLK_MQ_REQ_NOWAIT)
> return -EBUSY;
>
> /*
> @@ -813,7 +833,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
> smp_rmb();
>
> ret = wait_event_interruptible(q->mq_freeze_wq,
> - !atomic_read(&q->mq_freeze_depth) ||
> + (atomic_read(&q->mq_freeze_depth) == 0 &&
> + (preempt || !blk_queue_preempt_only(q))) ||
> blk_queue_dying(q));
> if (blk_queue_dying(q))
> return -ENODEV;
> @@ -1441,8 +1462,7 @@ static struct request *blk_old_get_request(struct request_queue *q,
> /* create ioc upfront */
> create_io_context(gfp_mask, q->node);
>
> - ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
> - (op & REQ_NOWAIT));
> + ret = blk_queue_enter(q, flags);
> if (ret)
> return ERR_PTR(ret);
> spin_lock_irq(q->queue_lock);
> @@ -2263,8 +2283,10 @@ blk_qc_t generic_make_request(struct bio *bio)
> current->bio_list = bio_list_on_stack;
> do {
> struct request_queue *q = bio->bi_disk->queue;
> + unsigned int flags = bio->bi_opf & REQ_NOWAIT ?
> + BLK_MQ_REQ_NOWAIT : 0;
>
> - if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
> + if (likely(blk_queue_enter(q, flags) == 0)) {
> struct bio_list lower, same;
>
> /* Create a fresh bio_list for all subordinate requests */
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 271657992d1a..1604bc2d4a57 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -386,7 +386,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
> struct request *rq;
> int ret;
>
> - ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
> + ret = blk_queue_enter(q, flags);
> if (ret)
> return ERR_PTR(ret);
>
> @@ -425,7 +425,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> if (hctx_idx >= q->nr_hw_queues)
> return ERR_PTR(-EIO);
>
> - ret = blk_queue_enter(q, true);
> + ret = blk_queue_enter(q, flags);
> if (ret)
> return ERR_PTR(ret);
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index 17ec83bb0900..b75d975cc5a5 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
> struct request *rq, *tmp;
> int next_set = 0;
>
> - if (blk_queue_enter(q, true))
> + if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT))
> return;
> spin_lock_irqsave(q->queue_lock, flags);
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 1c16a247fae6..0ba7af5debc7 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2926,21 +2926,30 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
> int
> scsi_device_quiesce(struct scsi_device *sdev)
> {
> + struct request_queue *q = sdev->request_queue;
> int err;
>
> + blk_mq_freeze_queue(q);
> + if (blk_set_preempt_only(q)) {
> + blk_mq_unfreeze_queue(q);
> + return -EINVAL;
> + }
This way is wrong, if blk_set_preempt_only() returns true
it means the queue has been in PREEMPT_ONLY already,
and failing scsi_device_quiesce() can break suspend/resume or
sending SCSI domain validation command.
The reasonable handling should be just going ahead if queue
is in PREEMPT_ONLY already.
> + /*
> + * Ensure that the effect of blk_set_preempt_only() will be visible
> + * for percpu_ref_tryget() callers that occur after the queue
> + * unfreeze. See also https://lwn.net/Articles/573497/.
> + */
> + synchronize_rcu();
This synchronize_rcu may be saved if we set the PREEMPT_ONLY flag
before freezing queue since blk_mq_freeze_queue() may implicate
one synchronize_rcu().
> + blk_mq_unfreeze_queue(q);
> +
> mutex_lock(&sdev->state_mutex);
> err = scsi_device_set_state(sdev, SDEV_QUIESCE);
> mutex_unlock(&sdev->state_mutex);
>
> if (err)
> - return err;
> + blk_clear_preempt_only(q);
>
> - scsi_run_queue(sdev->request_queue);
> - while (atomic_read(&sdev->device_busy)) {
> - msleep_interruptible(200);
> - scsi_run_queue(sdev->request_queue);
> - }
> - return 0;
> + return err;
> }
> EXPORT_SYMBOL(scsi_device_quiesce);
>
> @@ -2961,8 +2970,10 @@ void scsi_device_resume(struct scsi_device *sdev)
> */
> mutex_lock(&sdev->state_mutex);
> if (sdev->sdev_state == SDEV_QUIESCE &&
> - scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
> + scsi_device_set_state(sdev, SDEV_RUNNING) == 0) {
> + blk_clear_preempt_only(sdev->request_queue);
> scsi_run_queue(sdev->request_queue);
> + }
> mutex_unlock(&sdev->state_mutex);
scsi_run_queue() can be removed, and blk_clear_preempt_only() needn't
to be run with holding sdev->state_mutex, just like in quiesce path.
--
Ming
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably
2017-10-07 4:33 ` Ming Lei
@ 2017-10-09 21:16 ` Bart Van Assche
2017-10-09 21:42 ` Bart Van Assche
2017-10-10 9:50 ` Ming Lei
0 siblings, 2 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-09 21:16 UTC (permalink / raw)
To: ming.lei@redhat.com
Cc: jthumshirn@suse.de, linux-block@vger.kernel.org, hch@lst.de,
martin.petersen@oracle.com, axboe@kernel.dk,
oleksandr@natalenko.name, hare@suse.com, mcgrof@kernel.org
T24gU2F0LCAyMDE3LTEwLTA3IGF0IDEyOjMzICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
V2VkLCBPY3QgMDQsIDIwMTcgYXQgMDU6MDE6MTBQTSAtMDcwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IEl0IGlzIGVzc2VudGlhbCBkdXJpbmcgc3VzcGVuZCBhbmQgcmVzdW1lIHRoYXQg
bmVpdGhlciB0aGUgZmlsZXN5c3RlbQ0KPiA+IHN0YXRlIG5vciB0aGUgZmlsZXN5c3RlbSBtZXRh
ZGF0YSBpbiBSQU0gY2hhbmdlcy4gVGhpcyBpcyB3aHkgd2hpbGUNCj4gPiB0aGUgaGliZXJuYXRp
b24gaW1hZ2UgaXMgYmVpbmcgd3JpdHRlbiBvciByZXN0b3JlZCB0aGF0IFNDU0kgZGV2aWNlcw0K
PiANCj4gcXVpZXNjZSBpc24ndCB1c2VkIG9ubHkgZm9yIHN1c3BlbmQgYW5kIHJlc3VtZSwgQW5k
IHRoZSBpc3N1ZSBpc24ndA0KPiBzdXNwZW5kL3Jlc3VtZSBzcGVjaWZpYyB0b28uIFNvIHBsZWFz
ZSBjaGFuZ2UgdGhlIHRpdGxlL2NvbW1pdCBsb2cNCj4gYXMgc29ydCBvZiAnbWFrZSBTQ1NJIHF1
aWVzY2UgbW9yZSByZWxpYWJsZS9zYWZlJy4gDQoNCk9LLCBJIHdpbGwgbW9kaWZ5IHRoZSBwYXRj
aCB0aXRsZS4NCg0KPiA+IC0JCWlmIChwZXJjcHVfcmVmX3RyeWdldF9saXZlKCZxLT5xX3VzYWdl
X2NvdW50ZXIpKQ0KPiA+IC0JCQlyZXR1cm4gMDsNCj4gPiArCQlpZiAocGVyY3B1X3JlZl90cnln
ZXRfbGl2ZSgmcS0+cV91c2FnZV9jb3VudGVyKSkgew0KPiA+ICsJCQkvKg0KPiA+ICsJCQkgKiBU
aGUgY29kZSB0aGF0IHNldHMgdGhlIFBSRUVNUFRfT05MWSBmbGFnIGlzDQo+ID4gKwkJCSAqIHJl
c3BvbnNpYmxlIGZvciBlbnN1cmluZyB0aGF0IHRoYXQgZmxhZyBpcyBnbG9iYWxseQ0KPiA+ICsJ
CQkgKiB2aXNpYmxlIGJlZm9yZSB0aGUgcXVldWUgaXMgdW5mcm96ZW4uDQo+ID4gKwkJCSAqLw0K
PiA+ICsJCQlpZiAocHJlZW1wdCB8fCAhYmxrX3F1ZXVlX3ByZWVtcHRfb25seShxKSkgew0KPiAN
Cj4gUFJFRU1QVF9PTkxZIGZsYWcgaXMgY2hlY2tlZCB3aXRob3V0IFJDVSByZWFkIGxvY2sgaGVs
ZCwgc28gdGhlDQo+IHN5bmNocm9uaXplX3JjdSgpIG1heSBqdXN0IHdhaXQgZm9yIGNvbXBsZXRp
b24gb2YgcHJlLWV4aXQNCj4gcGVyY3B1X3JlZl90cnlnZXRfbGl2ZSgpLCB3aGljaCBjYW4gYmUg
cmVvcmRlcmVkIHdpdGggdGhlDQo+IHJlYWRpbmcgb24gYmxrX3F1ZXVlX3ByZWVtcHRfb25seSgp
Lg0KDQpPSywgSSB3aWxsIGFkZCByY3VfcmVhZF9sb2NrX3NjaGVkL3VubG9ja19zY2hlZCgpIGNh
bGxzIGFyb3VuZCB0aGlzIGNvZGUuDQoNCj4gPiAgaW50DQo+ID4gIHNjc2lfZGV2aWNlX3F1aWVz
Y2Uoc3RydWN0IHNjc2lfZGV2aWNlICpzZGV2KQ0KPiA+ICB7DQo+ID4gKwlzdHJ1Y3QgcmVxdWVz
dF9xdWV1ZSAqcSA9IHNkZXYtPnJlcXVlc3RfcXVldWU7DQo+ID4gIAlpbnQgZXJyOw0KPiA+ICAN
Cj4gPiArCWJsa19tcV9mcmVlemVfcXVldWUocSk7DQo+ID4gKwlpZiAoYmxrX3NldF9wcmVlbXB0
X29ubHkocSkpIHsNCj4gPiArCQlibGtfbXFfdW5mcmVlemVfcXVldWUocSk7DQo+ID4gKwkJcmV0
dXJuIC1FSU5WQUw7DQo+ID4gKwl9DQo+IA0KPiBUaGlzIHdheSBpcyB3cm9uZywgaWYgYmxrX3Nl
dF9wcmVlbXB0X29ubHkoKSByZXR1cm5zIHRydWUNCj4gaXQgbWVhbnMgdGhlIHF1ZXVlIGhhcyBi
ZWVuIGluIFBSRUVNUFRfT05MWSBhbHJlYWR5LA0KPiBhbmQgZmFpbGluZyBzY3NpX2RldmljZV9x
dWllc2NlKCkgY2FuIGJyZWFrIHN1c3BlbmQvcmVzdW1lIG9yDQo+IHNlbmRpbmcgU0NTSSBkb21h
aW4gdmFsaWRhdGlvbiBjb21tYW5kLg0KDQpUaGF0J3MgYW4gaW50ZXJlc3RpbmcgY29tbWVudCBi
dXQgSSBwcm9wb3NlIHRoYXQgd2hhdCB5b3Ugc3VnZ2VzdCBpcyBpbXBsZW1lbnRlZA0KdGhyb3Vn
aCBhIHNlcGFyYXRlIHBhdGNoLiBUaGUgYWJvdmUgY29kZSBwcmVzZXJ2ZXMgdGhlIGV4aXN0aW5n
IGJlaGF2aW9yLCBuYW1lbHkNCnRoYXQgc2NzaV9kZXZpY2VfcXVpZXNjZSgpIHJldHVybnMgYW4g
ZXJyb3IgY29kZSBpZiBjYWxsZWQgd2hlbiBhIFNDU0kgZGV2aWNlIGhhcw0KYWxyZWFkeSBiZWVu
IHF1aWVzY2VkLiBTZWUgYWxzbyBzY3NpX2RldmljZV9zZXRfc3RhdGUoKS4gQlRXLCBJIGRvbid0
IHRoaW5rIHRoYXQNCml0IGlzIGxpa2VseSB0aGF0IHRoaXMgd2lsbCBvY2N1ci4gVGhlIG9ubHkg
Y29kZSBvdGhlciB0aGFuIHRoZSBwb3dlciBtYW5hZ2VtZW50DQpjb2RlIEkga25vdyBvZiB0aGF0
IHNldHMgdGhlIFNDU0kgUVVJRVNDRSBzdGF0ZSBpcyB0aGUgU0NTSSBzeXNmcyBzdGF0ZSBzdG9y
ZQ0KbWV0aG9kIGFuZCB0aGUgU0NTSSBwYXJhbGxlbCBjb2RlLiBJIGRvbid0IGtub3cgYW55IHNv
ZnR3YXJlIHRoYXQgc2V0cyB0aGUgU0NTSQ0KUVVJRVNDRSBzdGF0ZSB0aHJvdWdoIHN5c2ZzLiBB
bmQgSSdtIG5vdCBzdXJlIHRoZSBTQ1NJIHBhcmFsbGVsIGNvZGUgaGFzIGENCnNpZ25pZmljYW50
IG51bWJlciBvZiB1c2Vycy4NCg0KPiA+ICsJLyoNCj4gPiArCSAqIEVuc3VyZSB0aGF0IHRoZSBl
ZmZlY3Qgb2YgYmxrX3NldF9wcmVlbXB0X29ubHkoKSB3aWxsIGJlIHZpc2libGUNCj4gPiArCSAq
IGZvciBwZXJjcHVfcmVmX3RyeWdldCgpIGNhbGxlcnMgdGhhdCBvY2N1ciBhZnRlciB0aGUgcXVl
dWUNCj4gPiArCSAqIHVuZnJlZXplLiBTZWUgYWxzbyBodHRwczovL2x3bi5uZXQvQXJ0aWNsZXMv
NTczNDk3Ly4NCj4gPiArCSAqLw0KPiA+ICsJc3luY2hyb25pemVfcmN1KCk7DQo+IA0KPiBUaGlz
IHN5bmNocm9uaXplX3JjdSBtYXkgYmUgc2F2ZWQgaWYgd2Ugc2V0IHRoZSBQUkVFTVBUX09OTFkg
ZmxhZw0KPiBiZWZvcmUgZnJlZXppbmcgcXVldWUgc2luY2UgYmxrX21xX2ZyZWV6ZV9xdWV1ZSgp
IG1heSBpbXBsaWNhdGUNCj4gb25lIHN5bmNocm9uaXplX3JjdSgpLg0KDQpUaGF0J3MgYW4gaW50
ZXJlc3RpbmcgY29tbWVudC4gSSB3aWxsIGxvb2sgZnVydGhlciBpbnRvIHRoaXMuDQoNCj4gPiBA
QCAtMjk2MSw4ICsyOTcwLDEwIEBAIHZvaWQgc2NzaV9kZXZpY2VfcmVzdW1lKHN0cnVjdCBzY3Np
X2RldmljZSAqc2RldikNCj4gPiAgCSAqLw0KPiA+ICAJbXV0ZXhfbG9jaygmc2Rldi0+c3RhdGVf
bXV0ZXgpOw0KPiA+ICAJaWYgKHNkZXYtPnNkZXZfc3RhdGUgPT0gU0RFVl9RVUlFU0NFICYmDQo+
ID4gLQkgICAgc2NzaV9kZXZpY2Vfc2V0X3N0YXRlKHNkZXYsIFNERVZfUlVOTklORykgPT0gMCkN
Cj4gPiArCSAgICBzY3NpX2RldmljZV9zZXRfc3RhdGUoc2RldiwgU0RFVl9SVU5OSU5HKSA9PSAw
KSB7DQo+ID4gKwkJYmxrX2NsZWFyX3ByZWVtcHRfb25seShzZGV2LT5yZXF1ZXN0X3F1ZXVlKTsN
Cj4gPiAgCQlzY3NpX3J1bl9xdWV1ZShzZGV2LT5yZXF1ZXN0X3F1ZXVlKTsNCj4gPiArCX0NCj4g
PiAgCW11dGV4X3VubG9jaygmc2Rldi0+c3RhdGVfbXV0ZXgpOw0KPiANCj4gc2NzaV9ydW5fcXVl
dWUoKSBjYW4gYmUgcmVtb3ZlZCwgYW5kIGJsa19jbGVhcl9wcmVlbXB0X29ubHkoKSBuZWVkbid0
DQo+IHRvIGJlIHJ1biB3aXRoIGhvbGRpbmcgc2Rldi0+c3RhdGVfbXV0ZXgsIGp1c3QgbGlrZSBp
biBxdWllc2NlIHBhdGguDQoNCkkgd2lsbCBsb29rIGludG8gcmVtb3ZpbmcgdGhlIHNjc2lfcnVu
X3F1ZXVlKCkgY2FsbC4gQnV0IEkgcHJlZmVyIHRvIGtlZXANCnRoZSBibGtfY2xlYXJfcHJlZW1w
dF9vbmx5KCkgaW5zaWRlIHRoZSBjcml0aWNhbCBzZWN0aW9uIGJlY2F1c2UgdGhhdCBjYWxsDQpk
b2Vzbid0IHNsZWVwIGFuZCBiZWNhdXNlIGl0IGNoYW5nZXMgc3RhdGUgaW5mb3JtYXRpb24gdGhh
dCBpcyByZWxhdGVkIHRvDQp0aGUgU0NTSSBzdGF0ZS4NCg0KQmFydC4=
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably
2017-10-09 21:16 ` Bart Van Assche
@ 2017-10-09 21:42 ` Bart Van Assche
2017-10-10 9:50 ` Ming Lei
1 sibling, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2017-10-09 21:42 UTC (permalink / raw)
To: ming.lei@redhat.com
Cc: jthumshirn@suse.de, linux-block@vger.kernel.org, hch@lst.de,
martin.petersen@oracle.com, axboe@kernel.dk,
oleksandr@natalenko.name, hare@suse.com, mcgrof@kernel.org
T24gTW9uLCAyMDE3LTEwLTA5IGF0IDE0OjE2IC0wNzAwLCBCYXJ0IFZhbiBBc3NjaGUgd3JvdGU6
DQo+IE9uIFNhdCwgMjAxNy0xMC0wNyBhdCAxMjozMyArMDgwMCwgTWluZyBMZWkgd3JvdGU6DQo+
ID4gT24gV2VkLCBPY3QgMDQsIDIwMTcgYXQgMDU6MDE6MTBQTSAtMDcwMCwgQmFydCBWYW4gQXNz
Y2hlIHdyb3RlOg0KPiA+ID4gIGludA0KPiA+ID4gIHNjc2lfZGV2aWNlX3F1aWVzY2Uoc3RydWN0
IHNjc2lfZGV2aWNlICpzZGV2KQ0KPiA+ID4gIHsNCj4gPiA+ICsJc3RydWN0IHJlcXVlc3RfcXVl
dWUgKnEgPSBzZGV2LT5yZXF1ZXN0X3F1ZXVlOw0KPiA+ID4gIAlpbnQgZXJyOw0KPiA+ID4gIA0K
PiA+ID4gKwlibGtfbXFfZnJlZXplX3F1ZXVlKHEpOw0KPiA+ID4gKwlpZiAoYmxrX3NldF9wcmVl
bXB0X29ubHkocSkpIHsNCj4gPiA+ICsJCWJsa19tcV91bmZyZWV6ZV9xdWV1ZShxKTsNCj4gPiA+
ICsJCXJldHVybiAtRUlOVkFMOw0KPiA+ID4gKwl9DQo+ID4gDQo+ID4gVGhpcyB3YXkgaXMgd3Jv
bmcsIGlmIGJsa19zZXRfcHJlZW1wdF9vbmx5KCkgcmV0dXJucyB0cnVlDQo+ID4gaXQgbWVhbnMg
dGhlIHF1ZXVlIGhhcyBiZWVuIGluIFBSRUVNUFRfT05MWSBhbHJlYWR5LA0KPiA+IGFuZCBmYWls
aW5nIHNjc2lfZGV2aWNlX3F1aWVzY2UoKSBjYW4gYnJlYWsgc3VzcGVuZC9yZXN1bWUgb3INCj4g
PiBzZW5kaW5nIFNDU0kgZG9tYWluIHZhbGlkYXRpb24gY29tbWFuZC4NCj4gDQo+IFRoYXQncyBh
biBpbnRlcmVzdGluZyBjb21tZW50IGJ1dCBJIHByb3Bvc2UgdGhhdCB3aGF0IHlvdSBzdWdnZXN0
IGlzIGltcGxlbWVudGVkDQo+IHRocm91Z2ggYSBzZXBhcmF0ZSBwYXRjaC4gVGhlIGFib3ZlIGNv
ZGUgcHJlc2VydmVzIHRoZSBleGlzdGluZyBiZWhhdmlvciwgbmFtZWx5DQo+IHRoYXQgc2NzaV9k
ZXZpY2VfcXVpZXNjZSgpIHJldHVybnMgYW4gZXJyb3IgY29kZSBpZiBjYWxsZWQgd2hlbiBhIFND
U0kgZGV2aWNlIGhhcw0KPiBhbHJlYWR5IGJlZW4gcXVpZXNjZWQuIFNlZSBhbHNvIHNjc2lfZGV2
aWNlX3NldF9zdGF0ZSgpLiBCVFcsIEkgZG9uJ3QgdGhpbmsgdGhhdA0KPiBpdCBpcyBsaWtlbHkg
dGhhdCB0aGlzIHdpbGwgb2NjdXIuIFRoZSBvbmx5IGNvZGUgb3RoZXIgdGhhbiB0aGUgcG93ZXIg
bWFuYWdlbWVudA0KPiBjb2RlIEkga25vdyBvZiB0aGF0IHNldHMgdGhlIFNDU0kgUVVJRVNDRSBz
dGF0ZSBpcyB0aGUgU0NTSSBzeXNmcyBzdGF0ZSBzdG9yZQ0KPiBtZXRob2QgYW5kIHRoZSBTQ1NJ
IHBhcmFsbGVsIGNvZGUuIEkgZG9uJ3Qga25vdyBhbnkgc29mdHdhcmUgdGhhdCBzZXRzIHRoZSBT
Q1NJDQo+IFFVSUVTQ0Ugc3RhdGUgdGhyb3VnaCBzeXNmcy4gQW5kIEknbSBub3Qgc3VyZSB0aGUg
U0NTSSBwYXJhbGxlbCBjb2RlIGhhcyBhDQo+IHNpZ25pZmljYW50IG51bWJlciBvZiB1c2Vycy4N
Cg0KQSBjb3JyZWN0aW9uOiB0aGUgY3VycmVudCBiZWhhdmlvciB3aGVuIHF1aWVzY2luZyBhbiBh
bHJlYWR5IHF1aWVzY2VkIGRldmljZSBpcw0KdGhhdCBzY3NpX2RldmljZV9xdWllc2NlKCkgcmV0
dXJucyAwIHdpdGhvdXQgY2hhbmdpbmcgYW55IHN0YXRlLiBBbnl3YXksIEkgcHJvcG9zZQ0KdG8g
a2VlcCB0aGF0IGJlaGF2aW9yIGFuZCBub3QgdG8gYWRkIG1vcmUgY29tcGxleGl0eSBub3cuDQoN
CkJhcnQu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably
2017-10-09 21:16 ` Bart Van Assche
2017-10-09 21:42 ` Bart Van Assche
@ 2017-10-10 9:50 ` Ming Lei
1 sibling, 0 replies; 15+ messages in thread
From: Ming Lei @ 2017-10-10 9:50 UTC (permalink / raw)
To: Bart Van Assche
Cc: jthumshirn@suse.de, linux-block@vger.kernel.org, hch@lst.de,
martin.petersen@oracle.com, axboe@kernel.dk,
oleksandr@natalenko.name, hare@suse.com, mcgrof@kernel.org
On Mon, Oct 09, 2017 at 09:16:53PM +0000, Bart Van Assche wrote:
> On Sat, 2017-10-07 at 12:33 +0800, Ming Lei wrote:
> > On Wed, Oct 04, 2017 at 05:01:10PM -0700, Bart Van Assche wrote:
> > > It is essential during suspend and resume that neither the filesystem
> > > state nor the filesystem metadata in RAM changes. This is why while
> > > the hibernation image is being written or restored that SCSI devices
> >
> > quiesce isn't used only for suspend and resume, And the issue isn't
> > suspend/resume specific too. So please change the title/commit log
> > as sort of 'make SCSI quiesce more reliable/safe'.
>
> OK, I will modify the patch title.
>
> > > - if (percpu_ref_tryget_live(&q->q_usage_counter))
> > > - return 0;
> > > + if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> > > + /*
> > > + * The code that sets the PREEMPT_ONLY flag is
> > > + * responsible for ensuring that that flag is globally
> > > + * visible before the queue is unfrozen.
> > > + */
> > > + if (preempt || !blk_queue_preempt_only(q)) {
> >
> > PREEMPT_ONLY flag is checked without RCU read lock held, so the
> > synchronize_rcu() may just wait for completion of pre-exit
> > percpu_ref_tryget_live(), which can be reordered with the
> > reading on blk_queue_preempt_only().
>
> OK, I will add rcu_read_lock_sched/unlock_sched() calls around this code.
>
> > > int
> > > scsi_device_quiesce(struct scsi_device *sdev)
> > > {
> > > + struct request_queue *q = sdev->request_queue;
> > > int err;
> > >
> > > + blk_mq_freeze_queue(q);
> > > + if (blk_set_preempt_only(q)) {
> > > + blk_mq_unfreeze_queue(q);
> > > + return -EINVAL;
> > > + }
> >
> > This way is wrong, if blk_set_preempt_only() returns true
> > it means the queue has been in PREEMPT_ONLY already,
> > and failing scsi_device_quiesce() can break suspend/resume or
> > sending SCSI domain validation command.
>
> That's an interesting comment but I propose that what you suggest is implemented
> through a separate patch. The above code preserves the existing behavior, namely
> that scsi_device_quiesce() returns an error code if called when a SCSI device has
> already been quiesced. See also scsi_device_set_state(). BTW, I don't think that
Please see scsi_device_set_state():
if (state == oldstate)
return 0;
And believe it or not, there is one real nested calling of scsi_device_quiesce()
in transport_spi.
> it is likely that this will occur. The only code other than the power management
> code I know of that sets the SCSI QUIESCE state is the SCSI sysfs state store
> method and the SCSI parallel code. I don't know any software that sets the SCSI
> QUIESCE state through sysfs. And I'm not sure the SCSI parallel code has a
> significant number of users.
As I know, one famous VM uses that, and the original report is that
this VM may hang after running I/O for days by our customer.
The revalidate path can trigger QUIESCE, and udev should cause that.
>
> > > + /*
> > > + * Ensure that the effect of blk_set_preempt_only() will be visible
> > > + * for percpu_ref_tryget() callers that occur after the queue
> > > + * unfreeze. See also https://lwn.net/Articles/573497/.
> > > + */
> > > + synchronize_rcu();
> >
> > This synchronize_rcu may be saved if we set the PREEMPT_ONLY flag
> > before freezing queue since blk_mq_freeze_queue() may implicate
> > one synchronize_rcu().
>
> That's an interesting comment. I will look further into this.
>
> > > @@ -2961,8 +2970,10 @@ void scsi_device_resume(struct scsi_device *sdev)
> > > */
> > > mutex_lock(&sdev->state_mutex);
> > > if (sdev->sdev_state == SDEV_QUIESCE &&
> > > - scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
> > > + scsi_device_set_state(sdev, SDEV_RUNNING) == 0) {
> > > + blk_clear_preempt_only(sdev->request_queue);
> > > scsi_run_queue(sdev->request_queue);
> > > + }
> > > mutex_unlock(&sdev->state_mutex);
> >
> > scsi_run_queue() can be removed, and blk_clear_preempt_only() needn't
> > to be run with holding sdev->state_mutex, just like in quiesce path.
>
> I will look into removing the scsi_run_queue() call. But I prefer to keep
> the blk_clear_preempt_only() inside the critical section because that call
> doesn't sleep and because it changes state information that is related to
> the SCSI state.
The thing is that the same state change in blk_set_preempt_only() can't
be protected by the lock, and this way may confuse people wrt. inconsistent
lock usage on same state protection.
--
Ming
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2017-10-10 9:50 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-05 0:01 [PATCH v6 0/9] block, scsi, md: Improve suspend and resume Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 1/9] md: Rename md_notifier into md_reboot_notifier Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 2/9] md: Introduce md_stop_all_writes() Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 3/9] md: Neither resync nor reshape while the system is frozen Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 4/9] block: Make q_usage_counter also track legacy requests Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 5/9] block: Introduce blk_get_request_flags() Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 6/9] block: Introduce BLK_MQ_REQ_PREEMPT Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests Bart Van Assche
2017-10-05 0:42 ` David Miller
2017-10-05 0:01 ` [PATCH v6 8/9] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag Bart Van Assche
2017-10-05 0:01 ` [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably Bart Van Assche
2017-10-07 4:33 ` Ming Lei
2017-10-09 21:16 ` Bart Van Assche
2017-10-09 21:42 ` Bart Van Assche
2017-10-10 9:50 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).