* [PATCH 0/9] dm patches for kernel v4.9 @ 2016-08-31 22:14 Bart Van Assche 2016-08-31 22:15 ` [PATCH 1/9] blk-mq: Introduce blk_mq_queue_stopped() Bart Van Assche ` (8 more replies) 0 siblings, 9 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:14 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development Hello Mike, This patch series avoids that removing a dm device hangs sporadically. I would appreciate it if you would consider these patches for inclusion in kernel v4.9. The patches in this series are: 0001-blk-mq-Introduce-blk_mq_queue_stopped.patch 0002-dm-Rename-a-function-argument.patch 0003-dm-Introduce-signal_pending_state.patch 0004-dm-Convert-wait-loops.patch 0005-dm-Add-two-lockdep_assert_held-statements.patch 0006-dm-Simplify-dm_old_stop_queue.patch 0007-dm-Mark-block-layer-queue-dead-before-destroying-the.patch 0008-dm-Fix-two-race-conditions-related-to-stopping-and-s.patch 0009-dm-path-selector-Avoid-that-device-removal-triggers-.patch Thanks, Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 1/9] blk-mq: Introduce blk_mq_queue_stopped() 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche @ 2016-08-31 22:15 ` Bart Van Assche 2016-08-31 22:16 ` [PATCH 2/9] dm: Rename a function argument Bart Van Assche ` (7 subsequent siblings) 8 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:15 UTC (permalink / raw) To: Mike Snitzer; +Cc: Jens Axboe, device-mapper development, Christoph Hellwig The function blk_queue_stopped() allows to test whether or not a non-mq queue has been stopped. Introduce a helper function that allows block drivers to query easily whether or not one or more hardware contexts of a blk-mq queue have been stopped. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> --- block/blk-mq.c | 17 +++++++++++++++++ include/linux/blk-mq.h | 1 + 2 files changed, 18 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 0dcafa6..6929a52 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -965,6 +965,23 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) } EXPORT_SYMBOL(blk_mq_run_hw_queues); +/* + * The caller is responsible for serializing this function against + * blk_mq_{start,stop}_hw_queue(). + */ +bool blk_mq_queue_stopped(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + queue_for_each_hw_ctx(q, hctx, i) + if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) + return true; + + return false; +} +EXPORT_SYMBOL(blk_mq_queue_stopped); + void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx) { cancel_delayed_work(&hctx->run_work); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index e43bbff..bd677bc 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -235,6 +235,7 @@ void blk_mq_kick_requeue_list(struct request_queue *q); void blk_mq_abort_requeue_list(struct request_queue *q); void blk_mq_complete_request(struct request *rq, int error); +bool blk_mq_queue_stopped(struct request_queue *q); void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx); void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx); void blk_mq_stop_hw_queues(struct request_queue *q); -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 2/9] dm: Rename a function argument 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche 2016-08-31 22:15 ` [PATCH 1/9] blk-mq: Introduce blk_mq_queue_stopped() Bart Van Assche @ 2016-08-31 22:16 ` Bart Van Assche 2016-09-01 3:29 ` Mike Snitzer 2016-08-31 22:16 ` [PATCH 3/9] dm: Introduce signal_pending_state() Bart Van Assche ` (6 subsequent siblings) 8 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:16 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development Rename 'interruptible' into 'sleep_state' to make it clear that this argument is a task state instead of a boolean. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index fa9b1cb..1d3627c 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1934,7 +1934,8 @@ void dm_put(struct mapped_device *md) } EXPORT_SYMBOL_GPL(dm_put); -static int dm_wait_for_completion(struct mapped_device *md, int interruptible) +/* @sleep_state: e.g. TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE */ +static int dm_wait_for_completion(struct mapped_device *md, int sleep_state) { int r = 0; DECLARE_WAITQUEUE(wait, current); @@ -1942,12 +1943,12 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible) add_wait_queue(&md->wait, &wait); while (1) { - set_current_state(interruptible); + set_current_state(sleep_state); if (!md_in_flight(md)) break; - if (interruptible == TASK_INTERRUPTIBLE && + if (sleep_state == TASK_INTERRUPTIBLE && signal_pending(current)) { r = -EINTR; break; @@ -2075,6 +2076,10 @@ static void unlock_fs(struct mapped_device *md) } /* + * @suspend_flags: DM_SUSPEND_LOCKFS_FLAG and/or DM_SUSPEND_NOFLUSH_FLAG + * @sleep_state: e.g. TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE + * @dmf_suspended_flag: DMF_SUSPENDED or DMF_SUSPENDED_INTERNALLY + * * If __dm_suspend returns 0, the device is completely quiescent * now. There is no request-processing activity. All new requests * are being added to md->deferred list. @@ -2082,7 +2087,7 @@ static void unlock_fs(struct mapped_device *md) * Caller must hold md->suspend_lock */ static int __dm_suspend(struct mapped_device *md, struct dm_table *map, - unsigned suspend_flags, int interruptible, + unsigned suspend_flags, int sleep_state, int dmf_suspended_flag) { bool do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG; @@ -2149,7 +2154,7 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map, * We call dm_wait_for_completion to wait for all existing requests * to finish. */ - r = dm_wait_for_completion(md, interruptible); + r = dm_wait_for_completion(md, sleep_state); if (!r) set_bit(dmf_suspended_flag, &md->flags); -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 2/9] dm: Rename a function argument 2016-08-31 22:16 ` [PATCH 2/9] dm: Rename a function argument Bart Van Assche @ 2016-09-01 3:29 ` Mike Snitzer 2016-09-01 14:17 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 3:29 UTC (permalink / raw) To: Bart Van Assche; +Cc: device-mapper development On Wed, Aug 31 2016 at 6:16pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > Rename 'interruptible' into 'sleep_state' to make it clear that this > argument is a task state instead of a boolean. > > Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Shouldn't the type also be changed from int to long (to match 'state' member from 'struct task_struct')? > --- > drivers/md/dm.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index fa9b1cb..1d3627c 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -1934,7 +1934,8 @@ void dm_put(struct mapped_device *md) > } > EXPORT_SYMBOL_GPL(dm_put); > > -static int dm_wait_for_completion(struct mapped_device *md, int interruptible) > +/* @sleep_state: e.g. TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE */ > +static int dm_wait_for_completion(struct mapped_device *md, int sleep_state) > { > int r = 0; > DECLARE_WAITQUEUE(wait, current); > @@ -1942,12 +1943,12 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible) > add_wait_queue(&md->wait, &wait); > > while (1) { > - set_current_state(interruptible); > + set_current_state(sleep_state); > > if (!md_in_flight(md)) > break; > > - if (interruptible == TASK_INTERRUPTIBLE && > + if (sleep_state == TASK_INTERRUPTIBLE && > signal_pending(current)) { > r = -EINTR; > break; > @@ -2075,6 +2076,10 @@ static void unlock_fs(struct mapped_device *md) > } > > /* > + * @suspend_flags: DM_SUSPEND_LOCKFS_FLAG and/or DM_SUSPEND_NOFLUSH_FLAG > + * @sleep_state: e.g. TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE > + * @dmf_suspended_flag: DMF_SUSPENDED or DMF_SUSPENDED_INTERNALLY > + * > * If __dm_suspend returns 0, the device is completely quiescent > * now. There is no request-processing activity. All new requests > * are being added to md->deferred list. > @@ -2082,7 +2087,7 @@ static void unlock_fs(struct mapped_device *md) > * Caller must hold md->suspend_lock > */ > static int __dm_suspend(struct mapped_device *md, struct dm_table *map, > - unsigned suspend_flags, int interruptible, > + unsigned suspend_flags, int sleep_state, > int dmf_suspended_flag) > { > bool do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG; > @@ -2149,7 +2154,7 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map, > * We call dm_wait_for_completion to wait for all existing requests > * to finish. > */ > - r = dm_wait_for_completion(md, interruptible); > + r = dm_wait_for_completion(md, sleep_state); > if (!r) > set_bit(dmf_suspended_flag, &md->flags); > > -- > 2.9.3 > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/9] dm: Rename a function argument 2016-09-01 3:29 ` Mike Snitzer @ 2016-09-01 14:17 ` Bart Van Assche 0 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 14:17 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development On 08/31/16 20:29, Mike Snitzer wrote: > On Wed, Aug 31 2016 at 6:16pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> Rename 'interruptible' into 'sleep_state' to make it clear that this >> argument is a task state instead of a boolean. > > Shouldn't the type also be changed from int to long (to match 'state' > member from 'struct task_struct')? Hello Mike, Today TASK_STATE_MAX equals 2048. So I think an int is big enough even on 32-bit platforms. Please let me know if you really want me to change the type from int to long. Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 3/9] dm: Introduce signal_pending_state() 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche 2016-08-31 22:15 ` [PATCH 1/9] blk-mq: Introduce blk_mq_queue_stopped() Bart Van Assche 2016-08-31 22:16 ` [PATCH 2/9] dm: Rename a function argument Bart Van Assche @ 2016-08-31 22:16 ` Bart Van Assche 2016-08-31 22:16 ` [PATCH 4/9] dm: Convert wait loops Bart Van Assche ` (5 subsequent siblings) 8 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:16 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development Use signal_pending_state() instead of open-coding it. This patch does not change any functionality but makes it possible to pass TASK_KILLABLE as the second argument of dm_wait_for_completion(). See also commit 16882c1e962b ("sched: fix TASK_WAKEKILL vs SIGKILL race"). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>. --- drivers/md/dm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 1d3627c..7cce09e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1948,8 +1948,7 @@ static int dm_wait_for_completion(struct mapped_device *md, int sleep_state) if (!md_in_flight(md)) break; - if (sleep_state == TASK_INTERRUPTIBLE && - signal_pending(current)) { + if (signal_pending_state(sleep_state, current)) { r = -EINTR; break; } -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 4/9] dm: Convert wait loops 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche ` (2 preceding siblings ...) 2016-08-31 22:16 ` [PATCH 3/9] dm: Introduce signal_pending_state() Bart Van Assche @ 2016-08-31 22:16 ` Bart Van Assche 2016-08-31 22:17 ` [PATCH 5/9] dm: Add two lockdep_assert_held() statements Bart Van Assche ` (4 subsequent siblings) 8 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:16 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development Use autoremove_wake_function() instead of default_wake_function() to make the dm wait loops more similar to other wait loops in the kernel. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm-mpath.c | 10 +++------- drivers/md/dm.c | 10 +++------- 2 files changed, 6 insertions(+), 14 deletions(-) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index ac734e5..8df7b12 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1193,21 +1193,17 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv) static void multipath_wait_for_pg_init_completion(struct multipath *m) { - DECLARE_WAITQUEUE(wait, current); - - add_wait_queue(&m->pg_init_wait, &wait); + DEFINE_WAIT(wait); while (1) { - set_current_state(TASK_UNINTERRUPTIBLE); + prepare_to_wait(&m->pg_init_wait, &wait, TASK_UNINTERRUPTIBLE); if (!atomic_read(&m->pg_init_in_progress)) break; io_schedule(); } - set_current_state(TASK_RUNNING); - - remove_wait_queue(&m->pg_init_wait, &wait); + finish_wait(&m->pg_init_wait, &wait); } static void flush_multipath_work(struct multipath *m) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 7cce09e..090ef31 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1938,12 +1938,10 @@ EXPORT_SYMBOL_GPL(dm_put); static int dm_wait_for_completion(struct mapped_device *md, int sleep_state) { int r = 0; - DECLARE_WAITQUEUE(wait, current); - - add_wait_queue(&md->wait, &wait); + DEFINE_WAIT(wait); while (1) { - set_current_state(sleep_state); + prepare_to_wait(&md->wait, &wait, sleep_state); if (!md_in_flight(md)) break; @@ -1955,9 +1953,7 @@ static int dm_wait_for_completion(struct mapped_device *md, int sleep_state) io_schedule(); } - set_current_state(TASK_RUNNING); - - remove_wait_queue(&md->wait, &wait); + finish_wait(&md->wait, &wait); return r; } -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 5/9] dm: Add two lockdep_assert_held() statements 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche ` (3 preceding siblings ...) 2016-08-31 22:16 ` [PATCH 4/9] dm: Convert wait loops Bart Van Assche @ 2016-08-31 22:17 ` Bart Van Assche 2016-08-31 22:17 ` [PATCH 6/9] dm: Simplify dm_old_stop_queue() Bart Van Assche ` (3 subsequent siblings) 8 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:17 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development Document the locking assumptions for the __bind() and __dm_suspend() functions. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 090ef31..0775411 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1648,6 +1648,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, struct request_queue *q = md->queue; sector_t size; + lockdep_assert_held(&md->suspend_lock); + size = dm_table_get_size(t); /* @@ -2089,6 +2091,8 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map, bool noflush = suspend_flags & DM_SUSPEND_NOFLUSH_FLAG; int r; + lockdep_assert_held(&md->suspend_lock); + /* * DMF_NOFLUSH_SUSPENDING must be set before presuspend. * This flag is cleared before dm_suspend returns. -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 6/9] dm: Simplify dm_old_stop_queue() 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche ` (4 preceding siblings ...) 2016-08-31 22:17 ` [PATCH 5/9] dm: Add two lockdep_assert_held() statements Bart Van Assche @ 2016-08-31 22:17 ` Bart Van Assche 2016-08-31 22:17 ` [PATCH 7/9] dm: Mark block layer queue dead before destroying the dm device Bart Van Assche ` (2 subsequent siblings) 8 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:17 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm-rq.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 1ca7463..8dc8cfb 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -89,12 +89,8 @@ static void dm_old_stop_queue(struct request_queue *q) unsigned long flags; spin_lock_irqsave(q->queue_lock, flags); - if (blk_queue_stopped(q)) { - spin_unlock_irqrestore(q->queue_lock, flags); - return; - } - - blk_stop_queue(q); + if (!blk_queue_stopped(q)) + blk_stop_queue(q); spin_unlock_irqrestore(q->queue_lock, flags); } -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 7/9] dm: Mark block layer queue dead before destroying the dm device 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche ` (5 preceding siblings ...) 2016-08-31 22:17 ` [PATCH 6/9] dm: Simplify dm_old_stop_queue() Bart Van Assche @ 2016-08-31 22:17 ` Bart Van Assche 2016-08-31 22:18 ` [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues Bart Van Assche 2016-08-31 22:18 ` [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop Bart Van Assche 8 siblings, 0 replies; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:17 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development This avoids that new requests are queued while __dm_destroy() is in progress. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 0775411..4aa7030 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1875,6 +1875,7 @@ EXPORT_SYMBOL_GPL(dm_device_name); static void __dm_destroy(struct mapped_device *md, bool wait) { + struct request_queue *q = dm_get_md_queue(md); struct dm_table *map; int srcu_idx; @@ -1885,6 +1886,10 @@ static void __dm_destroy(struct mapped_device *md, bool wait) set_bit(DMF_FREEING, &md->flags); spin_unlock(&_minor_lock); + spin_lock_irq(q->queue_lock); + queue_flag_set(QUEUE_FLAG_DYING, q); + spin_unlock_irq(q->queue_lock); + if (dm_request_based(md) && md->kworker_task) flush_kthread_worker(&md->kworker); -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche ` (6 preceding siblings ...) 2016-08-31 22:17 ` [PATCH 7/9] dm: Mark block layer queue dead before destroying the dm device Bart Van Assche @ 2016-08-31 22:18 ` Bart Van Assche 2016-09-01 3:13 ` Mike Snitzer 2016-08-31 22:18 ` [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop Bart Van Assche 8 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:18 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() calls have stopped before setting the "queue stopped" flag. This allows to remove the "queue stopped" test from dm_mq_queue_rq() and dm_mq_requeue_request(). Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm-rq.c | 29 +++++++---------------------- 1 file changed, 7 insertions(+), 22 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 8dc8cfb..b5db523 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -78,7 +78,6 @@ void dm_start_queue(struct request_queue *q) if (!q->mq_ops) dm_old_start_queue(q); else { - queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, q); blk_mq_start_stopped_hw_queues(q, true); blk_mq_kick_requeue_list(q); } @@ -98,13 +97,13 @@ void dm_stop_queue(struct request_queue *q) { if (!q->mq_ops) dm_old_stop_queue(q); - else { - spin_lock_irq(q->queue_lock); - queue_flag_set(QUEUE_FLAG_STOPPED, q); - spin_unlock_irq(q->queue_lock); - + else if (!blk_mq_queue_stopped(q)) { + /* Wait until dm_mq_queue_rq() has finished. */ + blk_mq_freeze_queue(q); + /* Avoid that requeuing could restart the queue. */ blk_mq_cancel_requeue_work(q); blk_mq_stop_hw_queues(q); + blk_mq_unfreeze_queue(q); } } @@ -318,13 +317,10 @@ static void dm_old_requeue_request(struct request *rq) static void dm_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q; - unsigned long flags; blk_mq_requeue_request(rq); - spin_lock_irqsave(q->queue_lock, flags); - if (!blk_queue_stopped(q)) - blk_mq_kick_requeue_list(q); - spin_unlock_irqrestore(q->queue_lock, flags); + WARN_ON_ONCE(blk_mq_queue_stopped(q)); + blk_mq_kick_requeue_list(q); } static void dm_requeue_original_request(struct mapped_device *md, @@ -867,17 +863,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, dm_put_live_table(md, srcu_idx); } - /* - * On suspend dm_stop_queue() handles stopping the blk-mq - * request_queue BUT: even though the hw_queues are marked - * BLK_MQ_S_STOPPED at that point there is still a race that - * is allowing block/blk-mq.c to call ->queue_rq against a - * hctx that it really shouldn't. The following check guards - * against this rarity (albeit _not_ race-free). - */ - if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state))) - return BLK_MQ_RQ_QUEUE_BUSY; - if (ti->type->busy && ti->type->busy(ti)) return BLK_MQ_RQ_QUEUE_BUSY; -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-08-31 22:18 ` [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues Bart Van Assche @ 2016-09-01 3:13 ` Mike Snitzer 2016-09-01 14:23 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 3:13 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe, device-mapper development, hch On Wed, Aug 31 2016 at 6:18pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() > calls have stopped before setting the "queue stopped" flag. This > allows to remove the "queue stopped" test from dm_mq_queue_rq() and > dm_mq_requeue_request(). Use BLK_MQ_S_STOPPED instead of > QUEUE_FLAG_STOPPED. > > Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> At first glance, at a minimum this patch needs a better header. It seems you're doing 2 things: 1) using blk_mq_{freeze,unfreeze}_queue() actually makes dm_stop_queue() work for blk-mq? Whereby fixing blk-mq race(s)? 2) switching away from QUEUE_FLAG_STOPPED to BLK_MQ_S_STOPPED (via blk_mq_queue_stopped) - not clear to me that dm-mq's use of QUEUE_FLAG_STOPPED wasn't fine; NVMe also uses it for blk-mq > --- > drivers/md/dm-rq.c | 29 +++++++---------------------- > 1 file changed, 7 insertions(+), 22 deletions(-) > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index 8dc8cfb..b5db523 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -78,7 +78,6 @@ void dm_start_queue(struct request_queue *q) > if (!q->mq_ops) > dm_old_start_queue(q); > else { > - queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, q); > blk_mq_start_stopped_hw_queues(q, true); > blk_mq_kick_requeue_list(q); > } > @@ -98,13 +97,13 @@ void dm_stop_queue(struct request_queue *q) > { > if (!q->mq_ops) > dm_old_stop_queue(q); > - else { > - spin_lock_irq(q->queue_lock); > - queue_flag_set(QUEUE_FLAG_STOPPED, q); > - spin_unlock_irq(q->queue_lock); > - > + else if (!blk_mq_queue_stopped(q)) { > + /* Wait until dm_mq_queue_rq() has finished. */ > + blk_mq_freeze_queue(q); > + /* Avoid that requeuing could restart the queue. */ > blk_mq_cancel_requeue_work(q); > blk_mq_stop_hw_queues(q); > + blk_mq_unfreeze_queue(q); > } > } > > @@ -318,13 +317,10 @@ static void dm_old_requeue_request(struct request *rq) > static void dm_mq_requeue_request(struct request *rq) > { > struct request_queue *q = rq->q; > - unsigned long flags; > > blk_mq_requeue_request(rq); > - spin_lock_irqsave(q->queue_lock, flags); > - if (!blk_queue_stopped(q)) > - blk_mq_kick_requeue_list(q); > - spin_unlock_irqrestore(q->queue_lock, flags); > + WARN_ON_ONCE(blk_mq_queue_stopped(q)); > + blk_mq_kick_requeue_list(q); > } > > static void dm_requeue_original_request(struct mapped_device *md, > @@ -867,17 +863,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > dm_put_live_table(md, srcu_idx); > } > > - /* > - * On suspend dm_stop_queue() handles stopping the blk-mq > - * request_queue BUT: even though the hw_queues are marked > - * BLK_MQ_S_STOPPED at that point there is still a race that > - * is allowing block/blk-mq.c to call ->queue_rq against a > - * hctx that it really shouldn't. The following check guards > - * against this rarity (albeit _not_ race-free). > - */ > - if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state))) > - return BLK_MQ_RQ_QUEUE_BUSY; > - > if (ti->type->busy && ti->type->busy(ti)) > return BLK_MQ_RQ_QUEUE_BUSY; > > -- > 2.9.3 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 3:13 ` Mike Snitzer @ 2016-09-01 14:23 ` Bart Van Assche 2016-09-01 15:05 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 14:23 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 08/31/16 20:14, Mike Snitzer wrote: > On Wed, Aug 31 2016 at 6:18pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() >> calls have stopped before setting the "queue stopped" flag. This >> allows to remove the "queue stopped" test from dm_mq_queue_rq() and >> dm_mq_requeue_request(). Use BLK_MQ_S_STOPPED instead of >> QUEUE_FLAG_STOPPED. > > At first glance, at a minimum this patch needs a better header. It > seems you're doing 2 things: > > 1) using blk_mq_{freeze,unfreeze}_queue() actually makes dm_stop_queue() > work for blk-mq? Whereby fixing blk-mq race(s)? > > 2) switching away from QUEUE_FLAG_STOPPED to BLK_MQ_S_STOPPED (via > blk_mq_queue_stopped) > - not clear to me that dm-mq's use of QUEUE_FLAG_STOPPED wasn't fine; > NVMe also uses it for blk-mq Hello Mike, Adding the blk_mq_{freeze,unfreeze}_queue() calls is indeed what fixes the race conditions related to stopping dm queues and what makes dm_stop_queue() work. If other blk-mq users and developers agree that QUEUE_FLAG_STOPPED should be set for stopped blk-mq queues then I think the code to set that flag should be moved into the blk-mq core. However, setting that flag for blk-mq queues seems redundant to me. Hence my proposal to introduce blk_mq_queue_stopped() instead. Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 14:23 ` Bart Van Assche @ 2016-09-01 15:05 ` Mike Snitzer 2016-09-01 15:31 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 15:05 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 10:23am -0400, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote: > On 08/31/16 20:14, Mike Snitzer wrote: > > On Wed, Aug 31 2016 at 6:18pm -0400, > > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > >> Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() > >> calls have stopped before setting the "queue stopped" flag. This > >> allows to remove the "queue stopped" test from dm_mq_queue_rq() and > >> dm_mq_requeue_request(). Use BLK_MQ_S_STOPPED instead of > >> QUEUE_FLAG_STOPPED. > > > > At first glance, at a minimum this patch needs a better header. It > > seems you're doing 2 things: > > > > 1) using blk_mq_{freeze,unfreeze}_queue() actually makes dm_stop_queue() > > work for blk-mq? Whereby fixing blk-mq race(s)? > > > > 2) switching away from QUEUE_FLAG_STOPPED to BLK_MQ_S_STOPPED (via > > blk_mq_queue_stopped) > > - not clear to me that dm-mq's use of QUEUE_FLAG_STOPPED wasn't fine; > > NVMe also uses it for blk-mq > > Hello Mike, > > Adding the blk_mq_{freeze,unfreeze}_queue() calls is indeed what fixes > the race conditions related to stopping dm queues and what makes > dm_stop_queue() work. OK, thanks for confirming. > If other blk-mq users and developers agree that QUEUE_FLAG_STOPPED > should be set for stopped blk-mq queues then I think the code to set > that flag should be moved into the blk-mq core. However, setting that > flag for blk-mq queues seems redundant to me. Hence my proposal to > introduce blk_mq_queue_stopped() instead. This is a secondary issue that can be dealt with independent of the rest of your DM patchset. I've staged most of your changes (with slight tweaks), see: https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 Only remaining issue is the queue dying race(s) in dm-multipath. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 15:05 ` Mike Snitzer @ 2016-09-01 15:31 ` Bart Van Assche 2016-09-01 15:50 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 15:31 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 08:05 AM, Mike Snitzer wrote: > I've staged most of your changes (with slight tweaks), see: > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 > > Only remaining issue is the queue dying race(s) in dm-multipath. Thanks Mike! Two minor comments though: * In dm_start_queue(), I think that the queue_flag_clear_unlocked() call should be converted into queue_flag_clear() and that it should be protected by the block layer queue lock. Every call of queue_flag_clear_unlocked() after block device initialization has finished is wrong if blk_cleanup_queue() can be called concurrently. * I think that adding blk_mq_{freeze,unfreeze}_queue() calls in dm_stop_queue() not only allows to remove the "queue stopped" test from dm_mq_queue_rq() but also that it allows to remove that test from dm_mq_requeue_request(). Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 15:31 ` Bart Van Assche @ 2016-09-01 15:50 ` Mike Snitzer 2016-09-01 16:12 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 15:50 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 11:31am -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 08:05 AM, Mike Snitzer wrote: > >I've staged most of your changes (with slight tweaks), see: > >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 > > > >Only remaining issue is the queue dying race(s) in dm-multipath. > > Thanks Mike! Two minor comments though: > * In dm_start_queue(), I think that the queue_flag_clear_unlocked() > call should be converted into queue_flag_clear() and that it should > be protected by the block layer queue lock. Every call of > queue_flag_clear_unlocked() after block device initialization has > finished is wrong if blk_cleanup_queue() can be called concurrently. OK, I'll have a look. > * I think that adding blk_mq_{freeze,unfreeze}_queue() calls in > dm_stop_queue() not only allows to remove the "queue stopped" test > from dm_mq_queue_rq() but also that it allows to remove that test > from dm_mq_requeue_request(). I'm aware you think that but I need to circle back to dm_mq_requeue_request() vs blk_queue_stopped(). The code as is isn't a problem, just might be an extra check that isn't needed any more. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 15:50 ` Mike Snitzer @ 2016-09-01 16:12 ` Mike Snitzer 2016-09-01 17:59 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 16:12 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 11:50am -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Sep 01 2016 at 11:31am -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > On 09/01/2016 08:05 AM, Mike Snitzer wrote: > > >I've staged most of your changes (with slight tweaks), see: > > >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 > > > > > >Only remaining issue is the queue dying race(s) in dm-multipath. > > > > Thanks Mike! Two minor comments though: > > * In dm_start_queue(), I think that the queue_flag_clear_unlocked() > > call should be converted into queue_flag_clear() and that it should > > be protected by the block layer queue lock. Every call of > > queue_flag_clear_unlocked() after block device initialization has > > finished is wrong if blk_cleanup_queue() can be called concurrently. > > OK, I'll have a look. Please see/test the dm-4.8 and dm-4.9 branches (dm-4.9 being rebased ontop of dm-4.8): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8 https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 Thanks. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 16:12 ` Mike Snitzer @ 2016-09-01 17:59 ` Bart Van Assche 2016-09-01 19:05 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 17:59 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 09:12 AM, Mike Snitzer wrote: > On Thu, Sep 01 2016 at 11:50am -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > >> On Thu, Sep 01 2016 at 11:31am -0400, >> Bart Van Assche <bart.vanassche@sandisk.com> wrote: >> >>> On 09/01/2016 08:05 AM, Mike Snitzer wrote: >>>> I've staged most of your changes (with slight tweaks), see: >>>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 >>>> >>>> Only remaining issue is the queue dying race(s) in dm-multipath. >>> >>> Thanks Mike! Two minor comments though: >>> * In dm_start_queue(), I think that the queue_flag_clear_unlocked() >>> call should be converted into queue_flag_clear() and that it should >>> be protected by the block layer queue lock. Every call of >>> queue_flag_clear_unlocked() after block device initialization has >>> finished is wrong if blk_cleanup_queue() can be called concurrently. >> >> OK, I'll have a look. > > Please see/test the dm-4.8 and dm-4.9 branches (dm-4.9 being rebased > ontop of dm-4.8): > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8 > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 Hello Mike, The result of my tests of the dm-4.9 branch is as follows: * With patch "dm mpath: check if path's request_queue is dying in activate_path()" I still see every now and then that CPU usage of one of the kworker threads jumps to 100%. * A "if (!blk_queue_stopped(q))" test needs to be added in dm_stop_queue() to avoid the following hang (that test was present in my version of the patch that adds the blk_mq_{freeze,unfreeze}_queue() calls): sysrq: SysRq : Show Blocked State task PC stack pid father multipathd D ffff8803c8d37b80 0 3242 1 0x00000000 Call Trace: [<ffffffff81627087>] schedule+0x37/0x90 [<ffffffff813097e1>] blk_mq_freeze_queue_wait+0x51/0xb0 [<ffffffff8130be05>] blk_mq_freeze_queue+0x15/0x20 [<ffffffffa034d882>] dm_stop_queue+0x62/0xc0 [dm_mod] [<ffffffffa0342a1b>] dm_swap_table+0x2fb/0x370 [dm_mod] [<ffffffffa0347875>] dev_suspend+0x95/0x220 [dm_mod] [<ffffffffa03480fc>] ctl_ioctl+0x1fc/0x550 [dm_mod] [<ffffffffa034845e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] [<ffffffff811ee27f>] do_vfs_ioctl+0x8f/0x690 [<ffffffff811ee8bc>] SyS_ioctl+0x3c/0x70 [<ffffffff8162d125>] entry_SYSCALL_64_fastpath+0x18/0xa8 Thanks, Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 17:59 ` Bart Van Assche @ 2016-09-01 19:05 ` Mike Snitzer 2016-09-01 19:35 ` Mike Snitzer 2016-09-01 20:15 ` Bart Van Assche 0 siblings, 2 replies; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 19:05 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 1:59pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 09:12 AM, Mike Snitzer wrote: > >On Thu, Sep 01 2016 at 11:50am -0400, > >Mike Snitzer <snitzer@redhat.com> wrote: > > > >>On Thu, Sep 01 2016 at 11:31am -0400, > >>Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> > >>>On 09/01/2016 08:05 AM, Mike Snitzer wrote: > >>>>I've staged most of your changes (with slight tweaks), see: > >>>>https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 > >>>> > >>>>Only remaining issue is the queue dying race(s) in dm-multipath. > >>> > >>>Thanks Mike! Two minor comments though: > >>>* In dm_start_queue(), I think that the queue_flag_clear_unlocked() > >>> call should be converted into queue_flag_clear() and that it should > >>> be protected by the block layer queue lock. Every call of > >>> queue_flag_clear_unlocked() after block device initialization has > >>> finished is wrong if blk_cleanup_queue() can be called concurrently. > >> > >>OK, I'll have a look. > > > >Please see/test the dm-4.8 and dm-4.9 branches (dm-4.9 being rebased > >ontop of dm-4.8): > >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8 > >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 > > Hello Mike, > > The result of my tests of the dm-4.9 branch is as follows: > * With patch "dm mpath: check if path's request_queue is dying in > activate_path()" I still see every now and then that CPU usage of > one of the kworker threads jumps to 100%. So you're saying that the dying queue check is still needed in the path selector? Would be useful to know why the 100% is occuring. Can you get a stack trace during this time? > * A "if (!blk_queue_stopped(q))" test needs to be added in > dm_stop_queue() to avoid the following hang (that test was present > in my version of the patch that adds the > blk_mq_{freeze,unfreeze}_queue() calls): > > sysrq: SysRq : Show Blocked State > task PC stack pid father > multipathd D ffff8803c8d37b80 0 3242 1 0x00000000 > Call Trace: > [<ffffffff81627087>] schedule+0x37/0x90 > [<ffffffff813097e1>] blk_mq_freeze_queue_wait+0x51/0xb0 > [<ffffffff8130be05>] blk_mq_freeze_queue+0x15/0x20 > [<ffffffffa034d882>] dm_stop_queue+0x62/0xc0 [dm_mod] > [<ffffffffa0342a1b>] dm_swap_table+0x2fb/0x370 [dm_mod] > [<ffffffffa0347875>] dev_suspend+0x95/0x220 [dm_mod] > [<ffffffffa03480fc>] ctl_ioctl+0x1fc/0x550 [dm_mod] > [<ffffffffa034845e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] > [<ffffffff811ee27f>] do_vfs_ioctl+0x8f/0x690 > [<ffffffff811ee8bc>] SyS_ioctl+0x3c/0x70 > [<ffffffff8162d125>] entry_SYSCALL_64_fastpath+0x18/0xa8 OK, I've adjusted accordingly and pushed dm-4.8 and dm-4.9 again (with force, sorry about that). ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 19:05 ` Mike Snitzer @ 2016-09-01 19:35 ` Mike Snitzer 2016-09-01 20:15 ` Bart Van Assche 1 sibling, 0 replies; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 19:35 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 3:05pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Sep 01 2016 at 1:59pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > * A "if (!blk_queue_stopped(q))" test needs to be added in > > dm_stop_queue() to avoid the following hang (that test was present > > in my version of the patch that adds the > > blk_mq_{freeze,unfreeze}_queue() calls): > > > > sysrq: SysRq : Show Blocked State > > task PC stack pid father > > multipathd D ffff8803c8d37b80 0 3242 1 0x00000000 > > Call Trace: > > [<ffffffff81627087>] schedule+0x37/0x90 > > [<ffffffff813097e1>] blk_mq_freeze_queue_wait+0x51/0xb0 > > [<ffffffff8130be05>] blk_mq_freeze_queue+0x15/0x20 > > [<ffffffffa034d882>] dm_stop_queue+0x62/0xc0 [dm_mod] > > [<ffffffffa0342a1b>] dm_swap_table+0x2fb/0x370 [dm_mod] > > [<ffffffffa0347875>] dev_suspend+0x95/0x220 [dm_mod] > > [<ffffffffa03480fc>] ctl_ioctl+0x1fc/0x550 [dm_mod] > > [<ffffffffa034845e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] > > [<ffffffff811ee27f>] do_vfs_ioctl+0x8f/0x690 > > [<ffffffff811ee8bc>] SyS_ioctl+0x3c/0x70 > > [<ffffffff8162d125>] entry_SYSCALL_64_fastpath+0x18/0xa8 > > OK, I've adjusted accordingly and pushed dm-4.8 and dm-4.9 again (with > force, sorry about that). I'm testing the result.. needs more time. I'll let you know once my mptest-based testing passes. (I think I'm holding the q->lock too long in dm_mq_stop_queue()). ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 19:05 ` Mike Snitzer 2016-09-01 19:35 ` Mike Snitzer @ 2016-09-01 20:15 ` Bart Van Assche 2016-09-01 20:33 ` Mike Snitzer 1 sibling, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 20:15 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 12:05 PM, Mike Snitzer wrote: > On Thu, Sep 01 2016 at 1:59pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: >> On 09/01/2016 09:12 AM, Mike Snitzer wrote: >>> Please see/test the dm-4.8 and dm-4.9 branches (dm-4.9 being rebased >>> ontop of dm-4.8): >>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8 >>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 >> >> Hello Mike, >> >> The result of my tests of the dm-4.9 branch is as follows: >> * With patch "dm mpath: check if path's request_queue is dying in >> activate_path()" I still see every now and then that CPU usage of >> one of the kworker threads jumps to 100%. > > So you're saying that the dying queue check is still needed in the path > selector? Would be useful to know why the 100% is occuring. Can you > get a stack trace during this time? Hello Mike, A few days ago I had already tried to obtain a stack trace with perf but the information reported by perf wasn't entirely accurate. What I know about that 100% CPU usage is as follows: * "dmsetup table" showed three SRP SCSI device nodes but these SRP SCSI device nodes were not visible in /sys/block. This means that scsi_remove_host() had already removed these from sysfs. * hctx->run_work kept being requeued over and over again on the kernel thread with name "kworker/3:1H". I assume this means that blk_mq_run_hw_queue() was called with the second argument (async) set to true. This probably means that the following dm-rq code was triggered: if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE) { /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); return BLK_MQ_RQ_QUEUE_BUSY; } Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 20:15 ` Bart Van Assche @ 2016-09-01 20:33 ` Mike Snitzer 2016-09-01 20:39 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 20:33 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 4:15pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 12:05 PM, Mike Snitzer wrote: > >On Thu, Sep 01 2016 at 1:59pm -0400, > >Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >>On 09/01/2016 09:12 AM, Mike Snitzer wrote: > >>>Please see/test the dm-4.8 and dm-4.9 branches (dm-4.9 being rebased > >>>ontop of dm-4.8): > >>>https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8 > >>>https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9 > >> > >>Hello Mike, > >> > >>The result of my tests of the dm-4.9 branch is as follows: > >>* With patch "dm mpath: check if path's request_queue is dying in > >>activate_path()" I still see every now and then that CPU usage of > >>one of the kworker threads jumps to 100%. > > > >So you're saying that the dying queue check is still needed in the path > >selector? Would be useful to know why the 100% is occuring. Can you > >get a stack trace during this time? > > Hello Mike, > > A few days ago I had already tried to obtain a stack trace with perf > but the information reported by perf wasn't entirely accurate. What > I know about that 100% CPU usage is as follows: > * "dmsetup table" showed three SRP SCSI device nodes but these SRP SCSI > device nodes were not visible in /sys/block. This means that > scsi_remove_host() had already removed these from sysfs. > * hctx->run_work kept being requeued over and over again on the kernel > thread with name "kworker/3:1H". I assume this means that > blk_mq_run_hw_queue() was called with the second argument (async) set > to true. This probably means that the following dm-rq code was > triggered: > > if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE) { > /* Undo dm_start_request() before requeuing */ > rq_end_stats(md, rq); > rq_completed(md, rq_data_dir(rq), false); > return BLK_MQ_RQ_QUEUE_BUSY; > } I'm able to easily reproduce this 100% cpu usage using mptest's test_02_sdev_delete. 'dmsetup suspend --nolockfs --noflush mp' hangs, seems rooted in your use of blk_mq_freeze_queue(): [ 298.136930] dmsetup D ffff880142cb3b70 0 9478 9414 0x00000080 [ 298.144831] ffff880142cb3b70 ffff880142cb3b28 ffff880330d6cb00 ffff88032d0022f8 [ 298.153132] ffff880142cb4000 ffff88032d0022f8 ffff88032b161800 0000000000000001 [ 298.161438] 0000000000000001 ffff880142cb3b88 ffffffff816c06e5 ffff88032d001aa0 [ 298.169740] Call Trace: [ 298.172473] [<ffffffff816c06e5>] schedule+0x35/0x80 [ 298.178019] [<ffffffff8131b937>] blk_mq_freeze_queue_wait+0x57/0xc0 [ 298.185116] [<ffffffff810c58c0>] ? prepare_to_wait_event+0xf0/0xf0 [ 298.192117] [<ffffffff8131d92a>] blk_mq_freeze_queue+0x1a/0x20 [ 298.198734] [<ffffffffa000e910>] dm_stop_queue+0x50/0xc0 [dm_mod] [ 298.205644] [<ffffffffa0001824>] __dm_suspend+0x134/0x1f0 [dm_mod] [ 298.212649] [<ffffffffa00035b8>] dm_suspend+0xb8/0xd0 [dm_mod] [ 298.219270] [<ffffffffa000882e>] dev_suspend+0x18e/0x240 [dm_mod] [ 298.226175] [<ffffffffa00086a0>] ? table_load+0x380/0x380 [dm_mod] [ 298.233180] [<ffffffffa0009027>] ctl_ioctl+0x1e7/0x4d0 [dm_mod] [ 298.239890] [<ffffffff81197f00>] ? lru_cache_add_active_or_unevictable+0x10/0xb0 [ 298.248253] [<ffffffffa0009323>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [ 298.255049] [<ffffffff81227937>] do_vfs_ioctl+0xa7/0x5d0 [ 298.261081] [<ffffffff8112787f>] ? __audit_syscall_entry+0xaf/0x100 [ 298.268178] [<ffffffff8100365d>] ? syscall_trace_enter+0x1dd/0x2c0 [ 298.275179] [<ffffffff81227ed9>] SyS_ioctl+0x79/0x90 [ 298.280821] [<ffffffff81003a47>] do_syscall_64+0x67/0x160 [ 298.286950] [<ffffffff816c4921>] entry_SYSCALL64_slow_path+0x25/0x25 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 20:33 ` Mike Snitzer @ 2016-09-01 20:39 ` Bart Van Assche 2016-09-01 20:48 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 20:39 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 01:33 PM, Mike Snitzer wrote: > I'm able to easily reproduce this 100% cpu usage using mptest's > test_02_sdev_delete. > > 'dmsetup suspend --nolockfs --noflush mp' hangs, seems rooted in your > use of blk_mq_freeze_queue(): > > [ 298.136930] dmsetup D ffff880142cb3b70 0 9478 9414 0x00000080 > [ 298.144831] ffff880142cb3b70 ffff880142cb3b28 ffff880330d6cb00 ffff88032d0022f8 > [ 298.153132] ffff880142cb4000 ffff88032d0022f8 ffff88032b161800 0000000000000001 > [ 298.161438] 0000000000000001 ffff880142cb3b88 ffffffff816c06e5 ffff88032d001aa0 > [ 298.169740] Call Trace: > [ 298.172473] [<ffffffff816c06e5>] schedule+0x35/0x80 > [ 298.178019] [<ffffffff8131b937>] blk_mq_freeze_queue_wait+0x57/0xc0 > [ 298.185116] [<ffffffff810c58c0>] ? prepare_to_wait_event+0xf0/0xf0 > [ 298.192117] [<ffffffff8131d92a>] blk_mq_freeze_queue+0x1a/0x20 > [ 298.198734] [<ffffffffa000e910>] dm_stop_queue+0x50/0xc0 [dm_mod] > [ 298.205644] [<ffffffffa0001824>] __dm_suspend+0x134/0x1f0 [dm_mod] > [ 298.212649] [<ffffffffa00035b8>] dm_suspend+0xb8/0xd0 [dm_mod] > [ 298.219270] [<ffffffffa000882e>] dev_suspend+0x18e/0x240 [dm_mod] > [ 298.226175] [<ffffffffa00086a0>] ? table_load+0x380/0x380 [dm_mod] > [ 298.233180] [<ffffffffa0009027>] ctl_ioctl+0x1e7/0x4d0 [dm_mod] > [ 298.239890] [<ffffffff81197f00>] ? lru_cache_add_active_or_unevictable+0x10/0xb0 > [ 298.248253] [<ffffffffa0009323>] dm_ctl_ioctl+0x13/0x20 [dm_mod] > [ 298.255049] [<ffffffff81227937>] do_vfs_ioctl+0xa7/0x5d0 > [ 298.261081] [<ffffffff8112787f>] ? __audit_syscall_entry+0xaf/0x100 > [ 298.268178] [<ffffffff8100365d>] ? syscall_trace_enter+0x1dd/0x2c0 > [ 298.275179] [<ffffffff81227ed9>] SyS_ioctl+0x79/0x90 > [ 298.280821] [<ffffffff81003a47>] do_syscall_64+0x67/0x160 > [ 298.286950] [<ffffffff816c4921>] entry_SYSCALL64_slow_path+0x25/0x25 > Hello Mike, In your dm-4.9 branch I see that you call blk_mq_freeze_queue() while holding the block layer queue lock. Please don't do this. blk_mq_freeze_queue() can sleep and as you know calling a sleeping function while holding a spinlock is not allowed. Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 20:39 ` Bart Van Assche @ 2016-09-01 20:48 ` Mike Snitzer 2016-09-01 20:52 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 20:48 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 4:39pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 01:33 PM, Mike Snitzer wrote: > >I'm able to easily reproduce this 100% cpu usage using mptest's > >test_02_sdev_delete. > > > >'dmsetup suspend --nolockfs --noflush mp' hangs, seems rooted in your > >use of blk_mq_freeze_queue(): > > > >[ 298.136930] dmsetup D ffff880142cb3b70 0 9478 9414 0x00000080 > >[ 298.144831] ffff880142cb3b70 ffff880142cb3b28 ffff880330d6cb00 ffff88032d0022f8 > >[ 298.153132] ffff880142cb4000 ffff88032d0022f8 ffff88032b161800 0000000000000001 > >[ 298.161438] 0000000000000001 ffff880142cb3b88 ffffffff816c06e5 ffff88032d001aa0 > >[ 298.169740] Call Trace: > >[ 298.172473] [<ffffffff816c06e5>] schedule+0x35/0x80 > >[ 298.178019] [<ffffffff8131b937>] blk_mq_freeze_queue_wait+0x57/0xc0 > >[ 298.185116] [<ffffffff810c58c0>] ? prepare_to_wait_event+0xf0/0xf0 > >[ 298.192117] [<ffffffff8131d92a>] blk_mq_freeze_queue+0x1a/0x20 > >[ 298.198734] [<ffffffffa000e910>] dm_stop_queue+0x50/0xc0 [dm_mod] > >[ 298.205644] [<ffffffffa0001824>] __dm_suspend+0x134/0x1f0 [dm_mod] > >[ 298.212649] [<ffffffffa00035b8>] dm_suspend+0xb8/0xd0 [dm_mod] > >[ 298.219270] [<ffffffffa000882e>] dev_suspend+0x18e/0x240 [dm_mod] > >[ 298.226175] [<ffffffffa00086a0>] ? table_load+0x380/0x380 [dm_mod] > >[ 298.233180] [<ffffffffa0009027>] ctl_ioctl+0x1e7/0x4d0 [dm_mod] > >[ 298.239890] [<ffffffff81197f00>] ? lru_cache_add_active_or_unevictable+0x10/0xb0 > >[ 298.248253] [<ffffffffa0009323>] dm_ctl_ioctl+0x13/0x20 [dm_mod] > >[ 298.255049] [<ffffffff81227937>] do_vfs_ioctl+0xa7/0x5d0 > >[ 298.261081] [<ffffffff8112787f>] ? __audit_syscall_entry+0xaf/0x100 > >[ 298.268178] [<ffffffff8100365d>] ? syscall_trace_enter+0x1dd/0x2c0 > >[ 298.275179] [<ffffffff81227ed9>] SyS_ioctl+0x79/0x90 > >[ 298.280821] [<ffffffff81003a47>] do_syscall_64+0x67/0x160 > >[ 298.286950] [<ffffffff816c4921>] entry_SYSCALL64_slow_path+0x25/0x25 > > > > Hello Mike, > > In your dm-4.9 branch I see that you call blk_mq_freeze_queue() > while holding the block layer queue lock. Please don't do this. > blk_mq_freeze_queue() can sleep and as you know calling a sleeping > function while holding a spinlock is not allowed. Yeah, I since fixed that. Doesn't change the fact that your use of blk_mq_freeze_queue() causes the 100% cpu usage. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 20:48 ` Mike Snitzer @ 2016-09-01 20:52 ` Bart Van Assche 2016-09-01 21:17 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 20:52 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 01:48 PM, Mike Snitzer wrote: > Yeah, I since fixed that. Doesn't change the fact that your use of > blk_mq_freeze_queue() causes the 100% cpu usage. Hello Mike, Sorry but that doesn't make sense to me. blk_mq_freeze_queue() either returns quickly or waits. It cannot cause 100% CPU usage. Something else must be going on. If you can make the code available that you used in your test I will have a look at it. Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 20:52 ` Bart Van Assche @ 2016-09-01 21:17 ` Mike Snitzer 2016-09-01 22:18 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 21:17 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 4:52pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 01:48 PM, Mike Snitzer wrote: > >Yeah, I since fixed that. Doesn't change the fact that your use of > >blk_mq_freeze_queue() causes the 100% cpu usage. > > Hello Mike, > > Sorry but that doesn't make sense to me. blk_mq_freeze_queue() > either returns quickly or waits. It cannot cause 100% CPU usage. > Something else must be going on. If you can make the code available > that you used in your test I will have a look at it. The suspend is hanging due the blk_mq_freeze_queue() -- and in turn the resume (after reinstating paths) never happens. So no paths are available. IO just keeps getting requeued, hence the 100% usage. The root of the problem is that the suspend isn't completing though. I've moved your work out to my private devel branch, see: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 21:17 ` Mike Snitzer @ 2016-09-01 22:18 ` Mike Snitzer 2016-09-01 22:22 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 22:18 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 5:17pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Sep 01 2016 at 4:52pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > On 09/01/2016 01:48 PM, Mike Snitzer wrote: > > >Yeah, I since fixed that. Doesn't change the fact that your use of > > >blk_mq_freeze_queue() causes the 100% cpu usage. > > > > Hello Mike, > > > > Sorry but that doesn't make sense to me. blk_mq_freeze_queue() > > either returns quickly or waits. It cannot cause 100% CPU usage. > > Something else must be going on. If you can make the code available > > that you used in your test I will have a look at it. > > The suspend is hanging due the blk_mq_freeze_queue() -- and in turn the > resume (after reinstating paths) never happens. So no paths are > available. IO just keeps getting requeued, hence the 100% usage. > > The root of the problem is that the suspend isn't completing though. > > I've moved your work out to my private devel branch, see: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel FYI I get the same 'dmsetup suspend --nolockfs --noflush mp' hang, running mptest's test_02_sdev_delete, when I try your unmodified patchset, see: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel.bart ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 22:18 ` Mike Snitzer @ 2016-09-01 22:22 ` Bart Van Assche 2016-09-01 22:26 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 22:22 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 03:18 PM, Mike Snitzer wrote: > FYI I get the same 'dmsetup suspend --nolockfs --noflush mp' hang, > running mptest's test_02_sdev_delete, when I try your unmodified > patchset, see: > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel.bart Hello Mike, Are you aware that the code on that branch is a *modified* version of my patch series? The following patch is not present on that branch: "dm path selector: Avoid that device removal triggers an infinite loop". There are also other (smaller) differences. Thanks, Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 22:22 ` Bart Van Assche @ 2016-09-01 22:26 ` Mike Snitzer 2016-09-01 23:17 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 22:26 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 6:22pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 03:18 PM, Mike Snitzer wrote: > >FYI I get the same 'dmsetup suspend --nolockfs --noflush mp' hang, > >running mptest's test_02_sdev_delete, when I try your unmodified > >patchset, see: > > > >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel.bart > > Hello Mike, > > Are you aware that the code on that branch is a *modified* version > of my patch series? The following patch is not present on that > branch: "dm path selector: Avoid that device removal triggers an > infinite loop". There are also other (smaller) differences. No, you're obviously talking about the 'devel' branch and not the 'devel.bart' branch I pointed to. The 'devel.bart' branch is the _exact_ patchset you sent. It has the same problem as the 'devel' branch. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 22:26 ` Mike Snitzer @ 2016-09-01 23:17 ` Bart Van Assche 2016-09-01 23:47 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 23:17 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de [-- Attachment #1: Type: text/plain, Size: 1663 bytes --] On 09/01/2016 03:27 PM, Mike Snitzer wrote: > On Thu, Sep 01 2016 at 6:22pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> On 09/01/2016 03:18 PM, Mike Snitzer wrote: >>> FYI I get the same 'dmsetup suspend --nolockfs --noflush mp' hang, >>> running mptest's test_02_sdev_delete, when I try your unmodified >>> patchset, see: >>> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel.bart >> >> Hello Mike, >> >> Are you aware that the code on that branch is a *modified* version >> of my patch series? The following patch is not present on that >> branch: "dm path selector: Avoid that device removal triggers an >> infinite loop". There are also other (smaller) differences. > > No, you're obviously talking about the 'devel' branch and not the > 'devel.bart' branch I pointed to. The 'devel.bart' branch is the > _exact_ patchset you sent. It has the same problem as the 'devel' > branch. Hello Mike, Sorry that I misread your previous e-mail. After I received your latest e-mail I rebased my tree on top of the devel.bart branch mentioned above. My tests still pass. The only two patches in my tree that are relevant and that are not in the devel.bart branch have been attached to this e-mail. Did your test involve the sd driver? If so, do the attached two patches help? If the sd driver was not involved, can you provide more information about the hang you ran into? The output and log messages generated by the following commands after the hang has been reproduced would be very welcome: * echo w > /proc/sysrq-trigger * (cd /sys/block && grep -a '' dm*/mq/*/{pending,cpu*/rq_list}) Thanks, Bart. [-- Attachment #2: 0001-block-dm-mpath-Introduce-request-flag-REQ_FAIL_IF_NO.patch --] [-- Type: text/x-patch, Size: 2857 bytes --] From 872971ee76d299e9c002b57b5ae6f12a4bd4ef92 Mon Sep 17 00:00:00 2001 From: Bart Van Assche <bart.vanassche@sandisk.com> Date: Thu, 30 Jun 2016 11:53:38 +0200 Subject: [PATCH 1/2] block, dm-mpath: Introduce request flag REQ_FAIL_IF_NO_PATH Let dm-mpath fail requests if queue_if_no_path is enabled and request flag REQ_FAIL_IF_NO_PATH has been set. This facility will be used in the next patch to fix a deadlock between SCSI device removal and sd event checking. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm-mpath.c | 6 ++++-- include/linux/blk_types.h | 4 ++++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 8df7b12..58ce8bc 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -540,6 +540,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone, struct multipath *m = ti->private; int r = DM_MAPIO_REQUEUE; size_t nr_bytes = clone ? blk_rq_bytes(clone) : blk_rq_bytes(rq); + bool fail_if_no_path = (clone ? : rq)->cmd_flags & REQ_FAIL_IF_NO_PATH; struct pgpath *pgpath; struct block_device *bdev; struct dm_mpath_io *mpio; @@ -550,7 +551,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone, pgpath = choose_pgpath(m, nr_bytes); if (!pgpath) { - if (!must_push_back_rq(m)) + if (fail_if_no_path || !must_push_back_rq(m)) r = -EIO; /* Failed */ return r; } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || @@ -1558,6 +1559,7 @@ static int do_end_io(struct multipath *m, struct request *clone, * clone bios for it and resubmit it later. */ int r = DM_ENDIO_REQUEUE; + bool fail_if_no_path = clone->cmd_flags & REQ_FAIL_IF_NO_PATH; if (!error && !clone->errors) return 0; /* I/O complete */ @@ -1570,7 +1572,7 @@ static int do_end_io(struct multipath *m, struct request *clone, if (!atomic_read(&m->nr_valid_paths)) { if (!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) { - if (!must_push_back_rq(m)) + if (fail_if_no_path || !must_push_back_rq(m)) r = -EIO; } else { if (error == -EBADE) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 436f43f..954e687 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -182,6 +182,9 @@ enum rq_flag_bits { __REQ_PM, /* runtime pm request */ __REQ_HASHED, /* on IO scheduler merge hash */ __REQ_MQ_INFLIGHT, /* track inflight for MQ */ + + __REQ_FAIL_IF_NO_PATH, /* fail if queued on dm-mpath with no paths */ + __REQ_NR_BITS, /* stops here */ }; @@ -228,6 +231,7 @@ enum rq_flag_bits { #define REQ_PM (1ULL << __REQ_PM) #define REQ_HASHED (1ULL << __REQ_HASHED) #define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT) +#define REQ_FAIL_IF_NO_PATH (1ULL << __REQ_FAIL_IF_NO_PATH) enum req_op { REQ_OP_READ, -- 2.9.3 [-- Attachment #3: 0002-sd-Fix-a-deadlock.patch --] [-- Type: text/x-patch, Size: 8343 bytes --] From 60acb292b29fc7eb40298e8d6dcc409315154340 Mon Sep 17 00:00:00 2001 From: Bart Van Assche <bart.vanassche@sandisk.com> Date: Thu, 30 Jun 2016 11:54:06 +0200 Subject: [PATCH 2/2] sd: Fix a deadlock If sd_check_events() is called for an sd device on top of a multipath device without paths then the scsi_test_unit_ready() call from that function will block until one or more paths have been added to the multipath device. However, if scsi_remove_host() is called before a path has been added then a deadlock is triggered. Fix this deadlock by making the scsi_test_unit_ready() call from sd_check_events() fail if there are no paths. SysRq-w reported the following call stacks: sysrq: SysRq : Show Blocked State task PC stack pid father kworker/6:1 D ffff88046346b958 0 95 2 0x00000000 Workqueue: events_freezable_power_ disk_events_workfn Call Trace: [<ffffffff815acf97>] schedule+0x37/0x90 [<ffffffff815b1487>] schedule_timeout+0x157/0x230 [<ffffffff815ac61f>] io_schedule_timeout+0x9f/0x110 [<ffffffff815adba9>] wait_for_completion_io_timeout+0xd9/0x110 [<ffffffff812aa9f8>] blk_execute_rq+0xa8/0x130 [<ffffffff81408411>] scsi_execute+0xf1/0x160 [<ffffffff8140a484>] scsi_execute_req_flags+0x84/0xf0 [<ffffffff8140aabd>] scsi_test_unit_ready+0x7d/0x130 [<ffffffff81416fe6>] sd_check_events+0x116/0x1a0 [<ffffffff812b477f>] disk_check_events+0x4f/0x130 [<ffffffff812b4877>] disk_events_workfn+0x17/0x20 [<ffffffff810737ed>] process_one_work+0x19d/0x490 [<ffffffff81073b29>] worker_thread+0x49/0x490 [<ffffffff8107a0ea>] kthread+0xea/0x100 [<ffffffff815b26ef>] ret_from_fork+0x1f/0x40 kworker/2:16 D ffff88006d9a3af0 0 2575 2 0x00000000 Workqueue: srp_remove srp_remove_work [ib_srp] Call Trace: [<ffffffff815acf97>] schedule+0x37/0x90 [<ffffffff815ad300>] schedule_preempt_disabled+0x10/0x20 [<ffffffff815af0b5>] mutex_lock_nested+0x145/0x360 [<ffffffff812b61ce>] disk_block_events+0x2e/0x80 [<ffffffff812b62e5>] del_gendisk+0x35/0x1d0 [<ffffffff81416ae6>] sd_remove+0x56/0xc0 [<ffffffff813e0812>] __device_release_driver+0x82/0x130 [<ffffffff813e08e0>] device_release_driver+0x20/0x30 [<ffffffff813dff33>] bus_remove_device+0x113/0x190 [<ffffffff813dc5bc>] device_del+0x12c/0x230 [<ffffffff814113a8>] __scsi_remove_device+0xf8/0x130 [<ffffffff8140f5d4>] scsi_forget_host+0x64/0x70 [<ffffffff814033c5>] scsi_remove_host+0x75/0x120 [<ffffffffa05b5e5b>] srp_remove_work+0x8b/0x1f0 [ib_srp] [<ffffffff810737ed>] process_one_work+0x19d/0x490 [<ffffffff81073b29>] worker_thread+0x49/0x490 [<ffffffff8107a0ea>] kthread+0xea/0x100 [<ffffffff815b26ef>] ret_from_fork+0x1f/0x40 multipathd D ffff88046d91fb40 0 2604 1 0x00000000 Call Trace: [<ffffffff815acf97>] schedule+0x37/0x90 [<ffffffff815ad300>] schedule_preempt_disabled+0x10/0x20 [<ffffffff815af0b5>] mutex_lock_nested+0x145/0x360 [<ffffffff812b61ce>] disk_block_events+0x2e/0x80 [<ffffffff811d8953>] __blkdev_get+0x53/0x480 [<ffffffff811d8ebb>] blkdev_get+0x13b/0x3a0 [<ffffffff811d9256>] blkdev_open+0x56/0x70 [<ffffffff81199a46>] do_dentry_open.isra.17+0x146/0x2d0 [<ffffffff8119ad58>] vfs_open+0x48/0x70 [<ffffffff811ab213>] path_openat+0x163/0xa10 [<ffffffff811aca29>] do_filp_open+0x79/0xd0 [<ffffffff8119b0a6>] do_sys_open+0x116/0x1f0 [<ffffffff8119b199>] SyS_open+0x19/0x20 [<ffffffff815b24a9>] entry_SYSCALL_64_fastpath+0x1c/0xac systemd-udevd D ffff880419273968 0 10074 475 0x00000004 Call Trace: [<ffffffff815acf97>] schedule+0x37/0x90 [<ffffffff815b14bb>] schedule_timeout+0x18b/0x230 [<ffffffff815ae261>] wait_for_completion+0xd1/0x110 [<ffffffff81073036>] flush_work+0x1d6/0x2a0 [<ffffffff8107339b>] __cancel_work_timer+0xfb/0x1b0 [<ffffffff8107346e>] cancel_delayed_work_sync+0xe/0x10 [<ffffffff812b621e>] disk_block_events+0x7e/0x80 [<ffffffff811d8953>] __blkdev_get+0x53/0x480 [<ffffffff811d8ebb>] blkdev_get+0x13b/0x3a0 [<ffffffff811d9256>] blkdev_open+0x56/0x70 [<ffffffff81199a46>] do_dentry_open.isra.17+0x146/0x2d0 [<ffffffff8119ad58>] vfs_open+0x48/0x70 [<ffffffff811ab213>] path_openat+0x163/0xa10 [<ffffffff811aca29>] do_filp_open+0x79/0xd0 [<ffffffff8119b0a6>] do_sys_open+0x116/0x1f0 [<ffffffff8119b199>] SyS_open+0x19/0x20 [<ffffffff815b24a9>] entry_SYSCALL_64_fastpath+0x1c/0xac mount D ffff8803b42b7be8 0 13567 13442 0x00000000 Call Trace: [<ffffffff815acf97>] schedule+0x37/0x90 [<ffffffff815b14bb>] schedule_timeout+0x18b/0x230 [<ffffffff815ac61f>] io_schedule_timeout+0x9f/0x110 [<ffffffff815ad786>] bit_wait_io+0x16/0x60 [<ffffffff815ad408>] __wait_on_bit+0x58/0x90 [<ffffffff8111e30f>] wait_on_page_bit_killable+0xbf/0xd0 [<ffffffff8111e450>] generic_file_read_iter+0x130/0x710 [<ffffffff811d7970>] blkdev_read_iter+0x30/0x40 [<ffffffff8119c0e9>] __vfs_read+0xb9/0x120 [<ffffffff8119c4c0>] vfs_read+0x90/0x130 [<ffffffff8119d884>] SyS_read+0x44/0xa0 [<ffffffff815b24a9>] entry_SYSCALL_64_fastpath+0x1c/0xac Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/scsi/scsi_lib.c | 17 +++++++++++++---- drivers/scsi/sd.c | 6 ++++-- include/scsi/scsi_device.h | 2 ++ 3 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index dd4766a..6b172f9 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2401,13 +2401,14 @@ EXPORT_SYMBOL(scsi_mode_sense); * @sshdr_external: Optional pointer to struct scsi_sense_hdr for * returning sense. Make sure that this is cleared before passing * in. + * @flags: or-ed into cmd_flags. * * Returns zero if unsuccessful or an error if TUR failed. For * removable media, UNIT_ATTENTION sets ->changed flag. **/ int -scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, - struct scsi_sense_hdr *sshdr_external) +scsi_test_unit_ready_flags(struct scsi_device *sdev, int timeout, int retries, + struct scsi_sense_hdr *sshdr_external, u64 flags) { char cmd[] = { TEST_UNIT_READY, 0, 0, 0, 0, 0, @@ -2422,8 +2423,8 @@ scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, /* try to eat the UNIT_ATTENTION if there are enough retries */ do { - result = scsi_execute_req(sdev, cmd, DMA_NONE, NULL, 0, sshdr, - timeout, retries, NULL); + result = scsi_execute_req_flags(sdev, cmd, DMA_NONE, NULL, 0, + sshdr, timeout, retries, NULL, flags); if (sdev->removable && scsi_sense_valid(sshdr) && sshdr->sense_key == UNIT_ATTENTION) sdev->changed = 1; @@ -2434,6 +2435,14 @@ scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, kfree(sshdr); return result; } +EXPORT_SYMBOL(scsi_test_unit_ready_flags); + +int scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, + struct scsi_sense_hdr *sshdr_external) +{ + return scsi_test_unit_ready_flags(sdev, timeout, retries, + sshdr_external, 0); +} EXPORT_SYMBOL(scsi_test_unit_ready); /** diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d3e852a..85d864b 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1439,8 +1439,10 @@ static unsigned int sd_check_events(struct gendisk *disk, unsigned int clearing) if (scsi_block_when_processing_errors(sdp)) { sshdr = kzalloc(sizeof(*sshdr), GFP_KERNEL); - retval = scsi_test_unit_ready(sdp, SD_TIMEOUT, SD_MAX_RETRIES, - sshdr); + retval = scsi_test_unit_ready_flags(sdp, SD_TIMEOUT, + SD_MAX_RETRIES, sshdr, REQ_FAIL_IF_NO_PATH); + if (retval) + pr_info("%s: TUR %#x\n", __func__, retval); } /* failed to execute TUR, assume media not present */ diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 8a95631..840030a 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -379,6 +379,8 @@ extern int scsi_mode_select(struct scsi_device *sdev, int pf, int sp, int timeout, int retries, struct scsi_mode_data *data, struct scsi_sense_hdr *); +extern int scsi_test_unit_ready_flags(struct scsi_device *sdev, int timeout, + int retries, struct scsi_sense_hdr *sshdr, u64 flags); extern int scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, struct scsi_sense_hdr *sshdr); extern int scsi_get_vpd_page(struct scsi_device *, u8 page, unsigned char *buf, -- 2.9.3 [-- Attachment #4: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 23:17 ` Bart Van Assche @ 2016-09-01 23:47 ` Mike Snitzer 2016-09-02 0:03 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 23:47 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 7:17pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 03:27 PM, Mike Snitzer wrote: > >On Thu, Sep 01 2016 at 6:22pm -0400, > >Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > >>On 09/01/2016 03:18 PM, Mike Snitzer wrote: > >>>FYI I get the same 'dmsetup suspend --nolockfs --noflush mp' hang, > >>>running mptest's test_02_sdev_delete, when I try your unmodified > >>>patchset, see: > >>> > >>>http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel.bart > >> > >>Hello Mike, > >> > >>Are you aware that the code on that branch is a *modified* version > >>of my patch series? The following patch is not present on that > >>branch: "dm path selector: Avoid that device removal triggers an > >>infinite loop". There are also other (smaller) differences. > > > >No, you're obviously talking about the 'devel' branch and not the > >'devel.bart' branch I pointed to. The 'devel.bart' branch is the > >_exact_ patchset you sent. It has the same problem as the 'devel' > >branch. > > Hello Mike, > > Sorry that I misread your previous e-mail. After I received your > latest e-mail I rebased my tree on top of the devel.bart branch > mentioned above. My tests still pass. The only two patches in my > tree that are relevant and that are not in the devel.bart branch > have been attached to this e-mail. Did your test involve the sd > driver? If so, do the attached two patches help? If the sd driver > was not involved, can you provide more information about the hang > you ran into? The output and log messages generated by the following > commands after the hang has been reproduced would be very welcome: > * echo w > /proc/sysrq-trigger > * (cd /sys/block && grep -a '' dm*/mq/*/{pending,cpu*/rq_list}) sd is used. I'll apply those patches and test, tomorrow, but I'm pretty skeptical. Haven't had any problems with these tests for quite a while. The tests I'm running are just those in the mptest testsuite, see: https://github.com/snitm/mptest Running them should be as simple as you doing: git clone git://github.com/snitm/mptest.git cd mptest ./runtest The default is to use dm-mq on scsi-mq ontop of tcmloop. multipath -ll shows: mp () dm-4 LIO-ORG ,rd size=1.0G features='4 queue_if_no_path retain_attached_hw_handler queue_mode mq' hwhandler='1 alua' wp=rw |-+- policy='queue-length 0' prio=-1 status=active | |- 7:0:1:0 sdj 8:144 active ready running | `- 8:0:1:0 sdk 8:160 active ready running `-+- policy='queue-length 0' prio=-1 status=enabled |- 9:0:1:0 sdl 8:176 active ready running `- 10:0:1:0 sdm 8:192 active ready running [ 4839.452237] scsi host7: TCM_Loopback [ 4839.472788] scsi host8: TCM_Loopback [ 4839.492867] scsi host9: TCM_Loopback [ 4839.512841] scsi host10: TCM_Loopback [ 4839.549430] scsi 7:0:1:0: Direct-Access LIO-ORG rd 4.0 PQ: 0 ANSI: 5 [ 4839.570556] scsi 7:0:1:0: alua: supports implicit and explicit TPGS [ 4839.577562] scsi 7:0:1:0: alua: device naa.600140559050dd34f6e46deb7e0e9f24 port group 0 rel port 1 [ 4839.587810] sd 7:0:1:0: [sdj] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 4839.587830] sd 7:0:1:0: Attached scsi generic sg10 type 0 [ 4839.593569] sd 7:0:1:0: alua: transition timeout set to 60 seconds [ 4839.593572] sd 7:0:1:0: alua: port group 00 state A non-preferred supports TOlUSNA [ 4839.608254] scsi 8:0:1:0: Direct-Access LIO-ORG rd 4.0 PQ: 0 ANSI: 5 [ 4839.626620] sd 7:0:1:0: [sdj] Write Protect is off [ 4839.631974] sd 7:0:1:0: [sdj] Mode Sense: 43 00 00 08 [ 4839.631999] sd 7:0:1:0: [sdj] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 4839.642209] loopback/naa.50014056fcae4fb4: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. [ 4839.652646] sd 7:0:1:0: [sdj] Attached SCSI disk [ 4839.673568] scsi 8:0:1:0: alua: supports implicit and explicit TPGS [ 4839.680573] scsi 8:0:1:0: alua: device naa.600140559050dd34f6e46deb7e0e9f24 port group 0 rel port 2 [ 4839.690814] sd 8:0:1:0: [sdk] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 4839.690888] sd 8:0:1:0: Attached scsi generic sg11 type 0 [ 4839.696543] sd 8:0:1:0: alua: port group 00 state A non-preferred supports TOlUSNA [ 4839.711419] scsi 9:0:1:0: Direct-Access LIO-ORG rd 4.0 PQ: 0 ANSI: 5 [ 4839.722730] sd 8:0:1:0: [sdk] Write Protect is off [ 4839.728076] sd 8:0:1:0: [sdk] Mode Sense: 43 00 00 08 [ 4839.728094] sd 8:0:1:0: [sdk] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 4839.738298] loopback/naa.500140553365fbe6: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. [ 4839.748700] sd 8:0:1:0: [sdk] Attached SCSI disk [ 4839.771561] scsi 9:0:1:0: alua: supports implicit and explicit TPGS [ 4839.778567] scsi 9:0:1:0: alua: device naa.600140559050dd34f6e46deb7e0e9f24 port group 0 rel port 3 [ 4839.788794] sd 9:0:1:0: [sdl] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 4839.788823] sd 9:0:1:0: Attached scsi generic sg12 type 0 [ 4839.794546] sd 9:0:1:0: alua: port group 00 state A non-preferred supports TOlUSNA [ 4839.809308] scsi 10:0:1:0: Direct-Access LIO-ORG rd 4.0 PQ: 0 ANSI: 5 [ 4839.820806] sd 9:0:1:0: [sdl] Write Protect is off [ 4839.826161] sd 9:0:1:0: [sdl] Mode Sense: 43 00 00 08 [ 4839.826181] sd 9:0:1:0: [sdl] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 4839.836379] loopback/naa.5001405631dca816: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. [ 4839.846762] sd 9:0:1:0: [sdl] Attached SCSI disk [ 4839.856572] scsi 10:0:1:0: alua: supports implicit and explicit TPGS [ 4839.863673] scsi 10:0:1:0: alua: device naa.600140559050dd34f6e46deb7e0e9f24 port group 0 rel port 4 [ 4839.874002] sd 10:0:1:0: [sdm] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 4839.874033] sd 10:0:1:0: Attached scsi generic sg13 type 0 [ 4839.879549] sd 10:0:1:0: alua: port group 00 state A non-preferred supports TOlUSNA [ 4839.897162] sd 10:0:1:0: [sdm] Write Protect is off [ 4839.902613] sd 10:0:1:0: [sdm] Mode Sense: 43 00 00 08 [ 4839.902632] sd 10:0:1:0: [sdm] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 4839.912935] loopback/naa.5001405afca06b48: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. [ 4839.923291] sd 10:0:1:0: [sdm] Attached SCSI disk [ 4841.065972] device-mapper: multipath queue-length: version 0.2.0 loaded ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-01 23:47 ` Mike Snitzer @ 2016-09-02 0:03 ` Bart Van Assche 2016-09-02 15:12 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-02 0:03 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On 09/01/2016 04:48 PM, Mike Snitzer wrote: > On Thu, Sep 01 2016 at 7:17pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: >> Sorry that I misread your previous e-mail. After I received your >> latest e-mail I rebased my tree on top of the devel.bart branch >> mentioned above. My tests still pass. The only two patches in my >> tree that are relevant and that are not in the devel.bart branch >> have been attached to this e-mail. Did your test involve the sd >> driver? If so, do the attached two patches help? If the sd driver >> was not involved, can you provide more information about the hang >> you ran into? The output and log messages generated by the following >> commands after the hang has been reproduced would be very welcome: >> * echo w > /proc/sysrq-trigger >> * (cd /sys/block && grep -a '' dm*/mq/*/{pending,cpu*/rq_list}) > > sd is used. I'll apply those patches and test, tomorrow, but I'm pretty > skeptical. > > Haven't had any problems with these tests for quite a while. The tests > I'm running are just those in the mptest testsuite, see: > https://github.com/snitm/mptest > > Running them should be as simple as you doing: > > git clone git://github.com/snitm/mptest.git > cd mptest > ./runtest > > The default is to use dm-mq on scsi-mq ontop of tcmloop. > > [ ... ] Hello Mike, If I run mptest on my setup I can reproduce the hang. But what I see is that the service-time path selector is in use when the hang is triggered. I will patch that path selector in the same way as I did with the queue-length path selector and rerun the test. # dmsetup table 1Linux_scsi_debug_2000: 0 2097152 multipath 3 retain_attached_hw_handler queue_mode mq 1 alua 3 1 service-time 0 1 2 8:128 1 1 service-time 0 1 2 8:144 1 1 service-time 0 1 2 8:112 1 1 mp: 0 2097152 multipath 3 retain_attached_hw_handler queue_mode mq 1 alua 2 1 queue-length 0 2 1 8:96 1 8:112 1 queue-length 0 2 1 8:128 1 8:144 1 # (cd /sys/block && grep -a '' dm*/mq/*/{pending,cpu*/rq_list}) | grep -v ':$' dm-0/mq/0/pending: ffff880358610000 dm-1/mq/0/pending: ffff880358220200 dm-1/mq/0/pending: ffff880358220400 Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues 2016-09-02 0:03 ` Bart Van Assche @ 2016-09-02 15:12 ` Mike Snitzer 2016-09-02 16:10 ` should blk-mq halt requeue processing while queue is frozen? [was: Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues] Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-02 15:12 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe@kernel.dk, device-mapper development, hch@lst.de On Thu, Sep 01 2016 at 8:03pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 04:48 PM, Mike Snitzer wrote: > > On Thu, Sep 01 2016 at 7:17pm -0400, > > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> Sorry that I misread your previous e-mail. After I received your > >> latest e-mail I rebased my tree on top of the devel.bart branch > >> mentioned above. My tests still pass. The only two patches in my > >> tree that are relevant and that are not in the devel.bart branch > >> have been attached to this e-mail. Did your test involve the sd > >> driver? If so, do the attached two patches help? If the sd driver > >> was not involved, can you provide more information about the hang > >> you ran into? The output and log messages generated by the following > >> commands after the hang has been reproduced would be very welcome: > >> * echo w > /proc/sysrq-trigger > >> * (cd /sys/block && grep -a '' dm*/mq/*/{pending,cpu*/rq_list}) > > > > sd is used. I'll apply those patches and test, tomorrow, but I'm pretty > > skeptical. > > > > Haven't had any problems with these tests for quite a while. The tests > > I'm running are just those in the mptest testsuite, see: > > https://github.com/snitm/mptest > > > > Running them should be as simple as you doing: > > > > git clone git://github.com/snitm/mptest.git > > cd mptest > > ./runtest > > > > The default is to use dm-mq on scsi-mq ontop of tcmloop. > > > > [ ... ] > > Hello Mike, > > If I run mptest on my setup I can reproduce the hang. But what I see is > that the service-time path selector is in use when the hang is triggered. > I will patch that path selector in the same way as I did with the > queue-length path selector and rerun the test. > > # dmsetup table > 1Linux_scsi_debug_2000: 0 2097152 multipath 3 retain_attached_hw_handler queue_mode mq 1 alua 3 1 service-time 0 1 2 8:128 1 1 service-time 0 1 2 8:144 1 1 service-time 0 1 2 8:112 1 1 > mp: 0 2097152 multipath 3 retain_attached_hw_handler queue_mode mq 1 alua 2 1 queue-length 0 2 1 8:96 1 8:112 1 queue-length 0 2 1 8:128 1 8:144 1 > # (cd /sys/block && grep -a '' dm*/mq/*/{pending,cpu*/rq_list}) | grep -v ':$' > dm-0/mq/0/pending: ffff880358610000 > dm-1/mq/0/pending: ffff880358220200 > dm-1/mq/0/pending: ffff880358220400 In case I haven't been clear: calling blk_mq_freeze_queue() _after_ you've suspended the DM device will only trigger IO that will get re-queued to the DM device's blk-mq queue. So you're creating a livelock since blk_mq_freeze_queue_wait() will not return, see stack from: https://www.redhat.com/archives/dm-devel/2016-September/msg00017.html FYI, the BLK_MQ_S_STOPPED check that you removed from dm_mq_queue_rq() in this commit... http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel.bart&id=69eb3e60e099a6117fc754e70eedd504685326ad ...is effectively serving as this when the device is suspended: diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index b5db523..3bc16dc 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -863,6 +863,9 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, dm_put_live_table(md, srcu_idx); } + if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) + return BLK_MQ_RQ_QUEUE_BUSY; + if (ti->type->busy && ti->type->busy(ti)) return BLK_MQ_RQ_QUEUE_BUSY; Which is comparable to what the old .request_fn DM does (see: dm_make_request). Without that type of suspend check the code will go on to call dm-mpath.c:multipath_clone_and_map() which will only result in DM_MAPIO_REQUEUE (and dm_mq_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY in the case when multipath has no paths, via must_push_back_rq()). Anyway, not what we want in general. The goal during DM device suspend is to stop IO from being mapped. With request-based DM suspend we're punting requests back to the block layer's request_queue. So in the case of blk-mq request-based DM: we cannot expect blk_mq_freeze_queue(), during suspend, to complete if requests are getting requeued to the blk-mq queue via BLK_MQ_RQ_QUEUE_BUSY. Mike ^ permalink raw reply related [flat|nested] 45+ messages in thread
* should blk-mq halt requeue processing while queue is frozen? [was: Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues] 2016-09-02 15:12 ` Mike Snitzer @ 2016-09-02 16:10 ` Mike Snitzer 2016-09-02 22:42 ` [dm-devel] should blk-mq halt requeue processing while queue is frozen? Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-02 16:10 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe, device-mapper development, hch, linux-block On Fri, Sep 02 2016 at 11:12am -0400, Mike Snitzer <snitzer@redhat.com> wrote: > So in the case of blk-mq request-based DM: we cannot expect > blk_mq_freeze_queue(), during suspend, to complete if requests are > getting requeued to the blk-mq queue via BLK_MQ_RQ_QUEUE_BUSY. Looking closer at blk-mq. Currently __blk_mq_run_hw_queue() will move any requeued requests to the hctx->dispatch list and then performs async blk_mq_run_hw_queue(). To do what you hoped (have blk_mq_freeze_queue() discontinue all use of blk-mq hw queues during DM suspend) I think we'd need blk-mq to: 1) avoid processing requeued IO if blk_mq_freeze_queue() was used to freeze the queue. Meaning it'd have to hold requeued work longer than it currently does. 2) Then once blk_mq_unfreeze_queue() is called it'd allow requeues to proceed. This would be catering to a very specific requirement of DM (given it re-queues IO back to the request_queue during suspend). BUT all said, relative to request-based DM multipath, what we have is perfectly fine on a correctness level: the requests are re-queued because the blk-mq DM device is suspended. Unfortunately on an efficiency level DM suspend creates a lot of busy looping in blk-mq, with 100% cpu usage in a threads with names "kworker/3:1H", ideally we'd avoid that! ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dm-devel] should blk-mq halt requeue processing while queue is frozen? 2016-09-02 16:10 ` should blk-mq halt requeue processing while queue is frozen? [was: Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues] Mike Snitzer @ 2016-09-02 22:42 ` Bart Van Assche 2016-09-03 0:34 ` Mike Snitzer 2016-09-07 16:41 ` Mike Snitzer 0 siblings, 2 replies; 45+ messages in thread From: Bart Van Assche @ 2016-09-02 22:42 UTC (permalink / raw) To: Mike Snitzer; +Cc: axboe, linux-block, device-mapper development, hch [-- Attachment #1: Type: text/plain, Size: 1852 bytes --] On 09/02/2016 09:10 AM, Mike Snitzer wrote: > On Fri, Sep 02 2016 at 11:12am -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > >> So in the case of blk-mq request-based DM: we cannot expect >> blk_mq_freeze_queue(), during suspend, to complete if requests are >> getting requeued to the blk-mq queue via BLK_MQ_RQ_QUEUE_BUSY. > > Looking closer at blk-mq. Currently __blk_mq_run_hw_queue() will move > any requeued requests to the hctx->dispatch list and then performs async > blk_mq_run_hw_queue(). > > To do what you hoped (have blk_mq_freeze_queue() discontinue all use of > blk-mq hw queues during DM suspend) I think we'd need blk-mq to: > 1) avoid processing requeued IO if blk_mq_freeze_queue() was used to > freeze the queue. Meaning it'd have to hold requeued work longer > than it currently does. > 2) Then once blk_mq_unfreeze_queue() is called it'd allow requeues to > proceed. > > This would be catering to a very specific requirement of DM (given it > re-queues IO back to the request_queue during suspend). > > BUT all said, relative to request-based DM multipath, what we have is > perfectly fine on a correctness level: the requests are re-queued > because the blk-mq DM device is suspended. > > Unfortunately on an efficiency level DM suspend creates a lot of busy > looping in blk-mq, with 100% cpu usage in a threads with names > "kworker/3:1H", ideally we'd avoid that! Hello Mike, What blk_mq_freeze_queue() does is to wait until queue_rq() has finished *and* all pending requests have completed. However, I think in dm_stop_queue() all we need is to wait until queue_rq() has finished. How about adding new functions in the block layer core to realize this, e.g. something like in the attached (untested) patch? Busy looping should be avoided - see also the tests of the new "quiescing" flag. Thanks, Bart. [-- Attachment #2: 0001-blk-mq-Introduce-blk_mq_quiesce_queue.patch --] [-- Type: text/x-patch, Size: 4321 bytes --] From e55a161ee4df7804767ed8faf9ddb698e8852b06 Mon Sep 17 00:00:00 2001 From: Bart Van Assche <bart.vanassche@sandisk.com> Date: Fri, 2 Sep 2016 09:32:17 -0700 Subject: [PATCH] blk-mq: Introduce blk_mq_quiesce_queue() --- block/blk-mq.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/blk-mq.h | 2 ++ include/linux/blkdev.h | 3 +++ 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 123d1ad..0320cd9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -135,6 +135,46 @@ void blk_mq_unfreeze_queue(struct request_queue *q) } EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); +/** + * blk_mq_quiesce_queue - wait until all pending queue_rq calls have finished + * + * Prevent that new I/O requests are queued and wait until all pending + * queue_rq() calls have finished. + */ +void blk_mq_quiesce_queue(struct request_queue *q) +{ + spin_lock_irq(q->queue_lock); + WARN_ON_ONCE(blk_queue_quiescing(q)); + queue_flag_set(QUEUE_FLAG_QUIESCING, q); + spin_unlock_irq(q->queue_lock); + + atomic_inc_return(&q->mq_freeze_depth); + blk_mq_run_hw_queues(q, false); + synchronize_rcu(); + + spin_lock_irq(q->queue_lock); + WARN_ON_ONCE(!blk_queue_quiescing(q)); + queue_flag_clear(QUEUE_FLAG_QUIESCING, q); + spin_unlock_irq(q->queue_lock); +} +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue); + +/** + * blk_mq_resume_queue - resume request processing + */ +void blk_mq_resume_queue(struct request_queue *q) +{ + int freeze_depth; + + freeze_depth = atomic_dec_return(&q->mq_freeze_depth); + WARN_ON_ONCE(freeze_depth < 0); + if (freeze_depth == 0) + wake_up_all(&q->mq_freeze_wq); + + blk_mq_run_hw_queues(q, false); +} +EXPORT_SYMBOL_GPL(blk_mq_resume_queue); + void blk_mq_wake_waiters(struct request_queue *q) { struct blk_mq_hw_ctx *hctx; @@ -506,6 +546,9 @@ static void blk_mq_requeue_work(struct work_struct *work) struct request *rq, *next; unsigned long flags; + if (blk_queue_quiescing(q)) + return; + spin_lock_irqsave(&q->requeue_lock, flags); list_splice_init(&q->requeue_list, &rq_list); spin_unlock_irqrestore(&q->requeue_lock, flags); @@ -806,6 +849,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) */ flush_busy_ctxs(hctx, &rq_list); + rcu_read_lock(); + /* * If we have previous entries on our dispatch list, grab them * and stuff them at the front for more fair dispatch. @@ -888,8 +933,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) * * blk_mq_run_hw_queue() already checks the STOPPED bit **/ - blk_mq_run_hw_queue(hctx, true); + if (!blk_queue_quiescing(q)) + blk_mq_run_hw_queue(hctx, true); } + + rcu_read_unlock(); } /* diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index bd677bc..8fc07bb 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -248,6 +248,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, void blk_mq_freeze_queue(struct request_queue *q); void blk_mq_unfreeze_queue(struct request_queue *q); void blk_mq_freeze_queue_start(struct request_queue *q); +void blk_mq_quiesce_queue(struct request_queue *q); +void blk_mq_resume_queue(struct request_queue *q); int blk_mq_reinit_tagset(struct blk_mq_tag_set *set); void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e79055c..9b360fc 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -505,6 +505,7 @@ struct request_queue { #define QUEUE_FLAG_FUA 24 /* device supports FUA writes */ #define QUEUE_FLAG_FLUSH_NQ 25 /* flush not queueuable */ #define QUEUE_FLAG_DAX 26 /* device supports DAX */ +#define QUEUE_FLAG_QUIESCING 27 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ @@ -595,6 +596,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) #define blk_queue_secure_erase(q) \ (test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags)) #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) +#define blk_queue_quiescing(q) test_bit(QUEUE_FLAG_QUIESCING, \ + &(q)->queue_flags) #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: should blk-mq halt requeue processing while queue is frozen? 2016-09-02 22:42 ` [dm-devel] should blk-mq halt requeue processing while queue is frozen? Bart Van Assche @ 2016-09-03 0:34 ` Mike Snitzer 2016-09-07 16:41 ` Mike Snitzer 1 sibling, 0 replies; 45+ messages in thread From: Mike Snitzer @ 2016-09-03 0:34 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe, linux-block, device-mapper development, hch On Fri, Sep 02 2016 at 6:42pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/02/2016 09:10 AM, Mike Snitzer wrote: > >On Fri, Sep 02 2016 at 11:12am -0400, > >Mike Snitzer <snitzer@redhat.com> wrote: > > > >>So in the case of blk-mq request-based DM: we cannot expect > >>blk_mq_freeze_queue(), during suspend, to complete if requests are > >>getting requeued to the blk-mq queue via BLK_MQ_RQ_QUEUE_BUSY. > > > >Looking closer at blk-mq. Currently __blk_mq_run_hw_queue() will move > >any requeued requests to the hctx->dispatch list and then performs async > >blk_mq_run_hw_queue(). > > > >To do what you hoped (have blk_mq_freeze_queue() discontinue all use of > >blk-mq hw queues during DM suspend) I think we'd need blk-mq to: > >1) avoid processing requeued IO if blk_mq_freeze_queue() was used to > > freeze the queue. Meaning it'd have to hold requeued work longer > > than it currently does. > >2) Then once blk_mq_unfreeze_queue() is called it'd allow requeues to > > proceed. > > > >This would be catering to a very specific requirement of DM (given it > >re-queues IO back to the request_queue during suspend). > > > >BUT all said, relative to request-based DM multipath, what we have is > >perfectly fine on a correctness level: the requests are re-queued > >because the blk-mq DM device is suspended. > > > >Unfortunately on an efficiency level DM suspend creates a lot of busy > >looping in blk-mq, with 100% cpu usage in a threads with names > >"kworker/3:1H", ideally we'd avoid that! > > Hello Mike, > > What blk_mq_freeze_queue() does is to wait until queue_rq() has > finished *and* all pending requests have completed. Right, I had a look at blk-mq this afternoon and it is clear that the current implementation of blk-mq's freeze (in terms of percpu q->q_usage_counter dropping to zero) won't fly for the DM requeue usecase. > However, I think > in dm_stop_queue() all we need is to wait until queue_rq() has > finished. How about adding new functions in the block layer core to > realize this, e.g. something like in the attached (untested) patch? > Busy looping should be avoided - see also the tests of the new > "quiescing" flag. I'll take a closer look at your patch next week. The reuse of the mq_freeze_depth to achieve this quiesce/resume will need closer review -- likely by Jens. > void blk_mq_wake_waiters(struct request_queue *q) > { > struct blk_mq_hw_ctx *hctx; > @@ -506,6 +546,9 @@ static void blk_mq_requeue_work(struct work_struct *work) > struct request *rq, *next; > unsigned long flags; > > + if (blk_queue_quiescing(q)) > + return; > + > spin_lock_irqsave(&q->requeue_lock, flags); > list_splice_init(&q->requeue_list, &rq_list); > spin_unlock_irqrestore(&q->requeue_lock, flags); > @@ -806,6 +849,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) > */ > flush_busy_ctxs(hctx, &rq_list); > > + rcu_read_lock(); > + > /* > * If we have previous entries on our dispatch list, grab them > * and stuff them at the front for more fair dispatch. > @@ -888,8 +933,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) > * > * blk_mq_run_hw_queue() already checks the STOPPED bit > **/ > - blk_mq_run_hw_queue(hctx, true); > + if (!blk_queue_quiescing(q)) > + blk_mq_run_hw_queue(hctx, true); > } > + > + rcu_read_unlock(); > } Yes, those are the correct hooks to place code to conditionally short-circuit normal blk-mq operation. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: should blk-mq halt requeue processing while queue is frozen? 2016-09-02 22:42 ` [dm-devel] should blk-mq halt requeue processing while queue is frozen? Bart Van Assche 2016-09-03 0:34 ` Mike Snitzer @ 2016-09-07 16:41 ` Mike Snitzer 2016-09-13 8:01 ` [dm-devel] " Bart Van Assche 1 sibling, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-07 16:41 UTC (permalink / raw) To: Bart Van Assche; +Cc: axboe, linux-block, device-mapper development, hch On Fri, Sep 02 2016 at 6:42pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > However, I think > in dm_stop_queue() all we need is to wait until queue_rq() has > finished. How about adding new functions in the block layer core to > realize this, e.g. something like in the attached (untested) patch? > Busy looping should be avoided - see also the tests of the new > "quiescing" flag. > > Thanks, > > Bart. Comments inlined below. > From e55a161ee4df7804767ed8faf9ddb698e8852b06 Mon Sep 17 00:00:00 2001 > From: Bart Van Assche <bart.vanassche@sandisk.com> > Date: Fri, 2 Sep 2016 09:32:17 -0700 > Subject: [PATCH] blk-mq: Introduce blk_mq_quiesce_queue() > > --- > block/blk-mq.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- > include/linux/blk-mq.h | 2 ++ > include/linux/blkdev.h | 3 +++ > 3 files changed, 54 insertions(+), 1 deletion(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 123d1ad..0320cd9 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -135,6 +135,46 @@ void blk_mq_unfreeze_queue(struct request_queue *q) > } > EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); > > +/** > + * blk_mq_quiesce_queue - wait until all pending queue_rq calls have finished > + * > + * Prevent that new I/O requests are queued and wait until all pending > + * queue_rq() calls have finished. > + */ > +void blk_mq_quiesce_queue(struct request_queue *q) > +{ > + spin_lock_irq(q->queue_lock); > + WARN_ON_ONCE(blk_queue_quiescing(q)); > + queue_flag_set(QUEUE_FLAG_QUIESCING, q); > + spin_unlock_irq(q->queue_lock); > + > + atomic_inc_return(&q->mq_freeze_depth); > + blk_mq_run_hw_queues(q, false); > + synchronize_rcu(); Why the synchronize_rcu()? Also, you're effectively open-coding blk_mq_freeze_queue_start() minus the q->q_usage_counter mgmt. Why not add a flag to conditionally manage q->q_usage_counter to blk_mq_freeze_queue_start()? > + spin_lock_irq(q->queue_lock); > + WARN_ON_ONCE(!blk_queue_quiescing(q)); > + queue_flag_clear(QUEUE_FLAG_QUIESCING, q); > + spin_unlock_irq(q->queue_lock); > +} > +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue); > + > +/** > + * blk_mq_resume_queue - resume request processing > + */ > +void blk_mq_resume_queue(struct request_queue *q) > +{ > + int freeze_depth; > + > + freeze_depth = atomic_dec_return(&q->mq_freeze_depth); > + WARN_ON_ONCE(freeze_depth < 0); > + if (freeze_depth == 0) > + wake_up_all(&q->mq_freeze_wq); Likewise, here you've open coded blk_mq_unfreeze_queue(). Adding a flag to conditionally reinit q->q_usage_counter would be better. But I'm concerned about blk_mq_{quiesce,resume}_queue vs blk_mq_{freeze,unfreeze}_queue -- e.g. if "freeze" is nested after "queue" (but before "resume") it would still need the q->q_usage_counter management. Your patch as-is would break the blk-mq freeze interface. > + blk_mq_run_hw_queues(q, false); > +} > +EXPORT_SYMBOL_GPL(blk_mq_resume_queue); > + > void blk_mq_wake_waiters(struct request_queue *q) > { > struct blk_mq_hw_ctx *hctx; > @@ -506,6 +546,9 @@ static void blk_mq_requeue_work(struct work_struct *work) > struct request *rq, *next; > unsigned long flags; > > + if (blk_queue_quiescing(q)) > + return; > + > spin_lock_irqsave(&q->requeue_lock, flags); > list_splice_init(&q->requeue_list, &rq_list); > spin_unlock_irqrestore(&q->requeue_lock, flags); > @@ -806,6 +849,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) > */ > flush_busy_ctxs(hctx, &rq_list); > > + rcu_read_lock(); > + > /* > * If we have previous entries on our dispatch list, grab them > * and stuff them at the front for more fair dispatch. > @@ -888,8 +933,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) > * > * blk_mq_run_hw_queue() already checks the STOPPED bit > **/ > - blk_mq_run_hw_queue(hctx, true); > + if (!blk_queue_quiescing(q)) > + blk_mq_run_hw_queue(hctx, true); > } > + > + rcu_read_unlock(); > } > > /* Please explain this extra rcu_read_{lock,unlock} ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dm-devel] should blk-mq halt requeue processing while queue is frozen? 2016-09-07 16:41 ` Mike Snitzer @ 2016-09-13 8:01 ` Bart Van Assche 2016-09-13 14:36 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-13 8:01 UTC (permalink / raw) To: Mike Snitzer, Bart Van Assche Cc: axboe, linux-block, device-mapper development, hch On 09/07/2016 06:41 PM, Mike Snitzer wrote: > On Fri, Sep 02 2016 at 6:42pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: >> +/** >> + * blk_mq_quiesce_queue - wait until all pending queue_rq calls have finished >> + * >> + * Prevent that new I/O requests are queued and wait until all pending >> + * queue_rq() calls have finished. >> + */ >> +void blk_mq_quiesce_queue(struct request_queue *q) >> +{ >> + spin_lock_irq(q->queue_lock); >> + WARN_ON_ONCE(blk_queue_quiescing(q)); >> + queue_flag_set(QUEUE_FLAG_QUIESCING, q); >> + spin_unlock_irq(q->queue_lock); >> + >> + atomic_inc_return(&q->mq_freeze_depth); >> + blk_mq_run_hw_queues(q, false); >> + synchronize_rcu(); > > Why the synchronize_rcu()? Hello Mike, Adding read_lock() + read_unlock() in __blk_mq_run_hw_queue() and synchronize_rcu() in blk_mq_quiesce_queue() is the lowest overhead mechanism I know of to make the latter function wait until the former has finished. > Also, you're effectively open-coding blk_mq_freeze_queue_start() minus > the q->q_usage_counter mgmt. Why not add a flag to conditionally manage > q->q_usage_counter to blk_mq_freeze_queue_start()? I will consider this. > But I'm concerned about blk_mq_{quiesce,resume}_queue vs > blk_mq_{freeze,unfreeze}_queue -- e.g. if "freeze" is nested after > "queue" (but before "resume") it would still need the q->q_usage_counter > management. Your patch as-is would break the blk-mq freeze interface. Agreed. blk_mq_{quiesce,resume}_queue() has to manipulate q_usage_counter in the same way as blk_mq_{freeze,unfreeze}_queue(). Once I am back in the office I will rework this patch and send it to Jens. Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: should blk-mq halt requeue processing while queue is frozen? 2016-09-13 8:01 ` [dm-devel] " Bart Van Assche @ 2016-09-13 14:36 ` Mike Snitzer 0 siblings, 0 replies; 45+ messages in thread From: Mike Snitzer @ 2016-09-13 14:36 UTC (permalink / raw) To: Bart Van Assche Cc: Bart Van Assche, axboe, linux-block, device-mapper development, hch On Tue, Sep 13 2016 at 4:01am -0400, Bart Van Assche <bart.vanassche@gmail.com> wrote: > On 09/07/2016 06:41 PM, Mike Snitzer wrote: > >On Fri, Sep 02 2016 at 6:42pm -0400, > >Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >>+/** > >>+ * blk_mq_quiesce_queue - wait until all pending queue_rq calls have finished > >>+ * > >>+ * Prevent that new I/O requests are queued and wait until all pending > >>+ * queue_rq() calls have finished. > >>+ */ > >>+void blk_mq_quiesce_queue(struct request_queue *q) > >>+{ > >>+ spin_lock_irq(q->queue_lock); > >>+ WARN_ON_ONCE(blk_queue_quiescing(q)); > >>+ queue_flag_set(QUEUE_FLAG_QUIESCING, q); > >>+ spin_unlock_irq(q->queue_lock); > >>+ > >>+ atomic_inc_return(&q->mq_freeze_depth); > >>+ blk_mq_run_hw_queues(q, false); > >>+ synchronize_rcu(); > > > >Why the synchronize_rcu()? > > Hello Mike, > > Adding read_lock() + read_unlock() in __blk_mq_run_hw_queue() and > synchronize_rcu() in blk_mq_quiesce_queue() is the lowest overhead > mechanism I know of to make the latter function wait until the > former has finished. OK. > >Also, you're effectively open-coding blk_mq_freeze_queue_start() minus > >the q->q_usage_counter mgmt. Why not add a flag to conditionally manage > >q->q_usage_counter to blk_mq_freeze_queue_start()? > > I will consider this. > > >But I'm concerned about blk_mq_{quiesce,resume}_queue vs > >blk_mq_{freeze,unfreeze}_queue -- e.g. if "freeze" is nested after > >"queue" (but before "resume") it would still need the q->q_usage_counter > >management. Your patch as-is would break the blk-mq freeze interface. > > Agreed. blk_mq_{quiesce,resume}_queue() has to manipulate > q_usage_counter in the same way as blk_mq_{freeze,unfreeze}_queue(). > Once I am back in the office I will rework this patch and send it to > Jens. Please base any further work in this area ontop of http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel And please verify all the mptest tests pass with your changes in place. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche ` (7 preceding siblings ...) 2016-08-31 22:18 ` [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues Bart Van Assche @ 2016-08-31 22:18 ` Bart Van Assche 2016-09-01 2:49 ` Mike Snitzer 8 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-08-31 22:18 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development If pg_init_retries is set and a request is queued against a multipath device with all underlying block devices in the "dying" state then an infinite loop is triggered because activate_path() never succeeds and hence never calls pg_init_done(). Fix this by making ql_select_path() skip dying paths. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> --- drivers/md/dm-queue-length.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c index 23f1786..a283c66 100644 --- a/drivers/md/dm-queue-length.c +++ b/drivers/md/dm-queue-length.c @@ -199,11 +199,12 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes) list_move_tail(s->valid_paths.next, &s->valid_paths); list_for_each_entry(pi, &s->valid_paths, list) { - if (!best || - (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) + if ((!best || + atomic_read(&pi->qlen) < atomic_read(&best->qlen)) && + !blk_queue_dying(pi->path->dev->bdev->bd_queue)) best = pi; - if (!atomic_read(&best->qlen)) + if (best && atomic_read(&best->qlen) == 0) break; } -- 2.9.3 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop 2016-08-31 22:18 ` [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop Bart Van Assche @ 2016-09-01 2:49 ` Mike Snitzer 2016-09-01 14:14 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 2:49 UTC (permalink / raw) To: Bart Van Assche; +Cc: device-mapper development On Wed, Aug 31 2016 at 6:18pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > If pg_init_retries is set and a request is queued against a > multipath device with all underlying block devices in the "dying" > state then an infinite loop is triggered because activate_path() > never succeeds and hence never calls pg_init_done(). Fix this by > making ql_select_path() skip dying paths. > > Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Assuming DM multipath needs to be sprinkling these dying queue checks so deep (which I'm not yet sold on): Same would be needed in service-time and round-robin right? > --- > drivers/md/dm-queue-length.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c > index 23f1786..a283c66 100644 > --- a/drivers/md/dm-queue-length.c > +++ b/drivers/md/dm-queue-length.c > @@ -199,11 +199,12 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes) > list_move_tail(s->valid_paths.next, &s->valid_paths); > > list_for_each_entry(pi, &s->valid_paths, list) { > - if (!best || > - (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) > + if ((!best || > + atomic_read(&pi->qlen) < atomic_read(&best->qlen)) && > + !blk_queue_dying(pi->path->dev->bdev->bd_queue)) > best = pi; > > - if (!atomic_read(&best->qlen)) > + if (best && atomic_read(&best->qlen) == 0) > break; > } > > -- > 2.9.3 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop 2016-09-01 2:49 ` Mike Snitzer @ 2016-09-01 14:14 ` Bart Van Assche 2016-09-01 15:06 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 14:14 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development On 08/31/16 20:29, Mike Snitzer wrote: > On Wed, Aug 31 2016 at 6:18pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> If pg_init_retries is set and a request is queued against a >> multipath device with all underlying block devices in the "dying" >> state then an infinite loop is triggered because activate_path() >> never succeeds and hence never calls pg_init_done(). Fix this by >> making ql_select_path() skip dying paths. > > Assuming DM multipath needs to be sprinkling these dying queue checks so > deep (which I'm not yet sold on): > > Same would be needed in service-time and round-robin right? Hello Mike, Before addressing service-time and round-robin path selectors I wanted to make sure that we reach agreement about how to fix the queue length path selector. Do you have a proposal for an alternative approach to fix the infinite loop that can be triggered during device removal? Thanks, Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop 2016-09-01 14:14 ` Bart Van Assche @ 2016-09-01 15:06 ` Mike Snitzer 2016-09-01 15:22 ` Bart Van Assche 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 15:06 UTC (permalink / raw) To: Bart Van Assche; +Cc: device-mapper development On Thu, Sep 01 2016 at 10:14am -0400, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote: > On 08/31/16 20:29, Mike Snitzer wrote: > > On Wed, Aug 31 2016 at 6:18pm -0400, > > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > >> If pg_init_retries is set and a request is queued against a > >> multipath device with all underlying block devices in the "dying" > >> state then an infinite loop is triggered because activate_path() > >> never succeeds and hence never calls pg_init_done(). Fix this by > >> making ql_select_path() skip dying paths. > > > > Assuming DM multipath needs to be sprinkling these dying queue checks so > > deep (which I'm not yet sold on): > > > > Same would be needed in service-time and round-robin right? > > Hello Mike, > > Before addressing service-time and round-robin path selectors I wanted > to make sure that we reach agreement about how to fix the queue length > path selector. > > Do you have a proposal for an alternative approach to fix the infinite > loop that can be triggered during device removal? I'm going to look closer now. But I'd prefer to see the "dying" state check(s) elevated to DM muiltpath. Really would rather the path selectors not have to worry about this state. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop 2016-09-01 15:06 ` Mike Snitzer @ 2016-09-01 15:22 ` Bart Van Assche 2016-09-01 15:26 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Bart Van Assche @ 2016-09-01 15:22 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development On 09/01/2016 08:06 AM, Mike Snitzer wrote: > On Thu, Sep 01 2016 at 10:14am -0400, > Bart Van Assche <Bart.VanAssche@sandisk.com> wrote: > >> On 08/31/16 20:29, Mike Snitzer wrote: >>> On Wed, Aug 31 2016 at 6:18pm -0400, >>> Bart Van Assche <bart.vanassche@sandisk.com> wrote: >>> >>>> If pg_init_retries is set and a request is queued against a >>>> multipath device with all underlying block devices in the "dying" >>>> state then an infinite loop is triggered because activate_path() >>>> never succeeds and hence never calls pg_init_done(). Fix this by >>>> making ql_select_path() skip dying paths. >>> >>> Assuming DM multipath needs to be sprinkling these dying queue checks so >>> deep (which I'm not yet sold on): >>> >>> Same would be needed in service-time and round-robin right? >> >> Hello Mike, >> >> Before addressing service-time and round-robin path selectors I wanted >> to make sure that we reach agreement about how to fix the queue length >> path selector. >> >> Do you have a proposal for an alternative approach to fix the infinite >> loop that can be triggered during device removal? > > I'm going to look closer now. But I'd prefer to see the "dying" state > check(s) elevated to DM multipath. Really would rather the path > selectors not have to worry about this state. Hello Mike, How about making blk_cleanup_queue() invoke a callback function in dm or dm-mpath and to use that callback function to keep track of the number of paths that are not in the "dying" state? That would allow to detect in the dm or dm-mpath driver whether or not all paths are in the dying state without having to modify every path selector. This is just an idea - there might be better alternatives. Bart. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop 2016-09-01 15:22 ` Bart Van Assche @ 2016-09-01 15:26 ` Mike Snitzer 0 siblings, 0 replies; 45+ messages in thread From: Mike Snitzer @ 2016-09-01 15:26 UTC (permalink / raw) To: Bart Van Assche; +Cc: device-mapper development On Thu, Sep 01 2016 at 11:22P -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 09/01/2016 08:06 AM, Mike Snitzer wrote: > >On Thu, Sep 01 2016 at 10:14am -0400, > >Bart Van Assche <Bart.VanAssche@sandisk.com> wrote: > > > >>On 08/31/16 20:29, Mike Snitzer wrote: > >>>On Wed, Aug 31 2016 at 6:18pm -0400, > >>>Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >>> > >>>>If pg_init_retries is set and a request is queued against a > >>>>multipath device with all underlying block devices in the "dying" > >>>>state then an infinite loop is triggered because activate_path() > >>>>never succeeds and hence never calls pg_init_done(). Fix this by > >>>>making ql_select_path() skip dying paths. > >>> > >>>Assuming DM multipath needs to be sprinkling these dying queue checks so > >>>deep (which I'm not yet sold on): > >>> > >>>Same would be needed in service-time and round-robin right? > >> > >>Hello Mike, > >> > >>Before addressing service-time and round-robin path selectors I wanted > >>to make sure that we reach agreement about how to fix the queue length > >>path selector. > >> > >>Do you have a proposal for an alternative approach to fix the infinite > >>loop that can be triggered during device removal? > > > >I'm going to look closer now. But I'd prefer to see the "dying" state > >check(s) elevated to DM multipath. Really would rather the path > >selectors not have to worry about this state. > > Hello Mike, > > How about making blk_cleanup_queue() invoke a callback function in dm or > dm-mpath and to use that callback function to keep track of the number of > paths that are not in the "dying" state? That would allow to detect in the > dm or dm-mpath driver whether or not all paths are in the dying state > without having to modify every path selector. This is just an idea - there > might be better alternatives. Even that seems like overkill. What about this? Any chance you could try the linux-dm.git 'dm-4.9' branch with this patch ontop? diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index ac734e5..15db5e9 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1521,10 +1521,10 @@ static void activate_path(struct work_struct *work) { struct pgpath *pgpath = container_of(work, struct pgpath, activate_path.work); + struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev); - if (pgpath->is_active) - scsi_dh_activate(bdev_get_queue(pgpath->path.dev->bdev), - pg_init_done, pgpath); + if (pgpath->is_active && !blk_queue_dying(q)) + scsi_dh_activate(q, pg_init_done, pgpath); else pg_init_done(pgpath, SCSI_DH_DEV_OFFLINED); } ^ permalink raw reply related [flat|nested] 45+ messages in thread
end of thread, other threads:[~2016-09-13 14:36 UTC | newest] Thread overview: 45+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-31 22:14 [PATCH 0/9] dm patches for kernel v4.9 Bart Van Assche 2016-08-31 22:15 ` [PATCH 1/9] blk-mq: Introduce blk_mq_queue_stopped() Bart Van Assche 2016-08-31 22:16 ` [PATCH 2/9] dm: Rename a function argument Bart Van Assche 2016-09-01 3:29 ` Mike Snitzer 2016-09-01 14:17 ` Bart Van Assche 2016-08-31 22:16 ` [PATCH 3/9] dm: Introduce signal_pending_state() Bart Van Assche 2016-08-31 22:16 ` [PATCH 4/9] dm: Convert wait loops Bart Van Assche 2016-08-31 22:17 ` [PATCH 5/9] dm: Add two lockdep_assert_held() statements Bart Van Assche 2016-08-31 22:17 ` [PATCH 6/9] dm: Simplify dm_old_stop_queue() Bart Van Assche 2016-08-31 22:17 ` [PATCH 7/9] dm: Mark block layer queue dead before destroying the dm device Bart Van Assche 2016-08-31 22:18 ` [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues Bart Van Assche 2016-09-01 3:13 ` Mike Snitzer 2016-09-01 14:23 ` Bart Van Assche 2016-09-01 15:05 ` Mike Snitzer 2016-09-01 15:31 ` Bart Van Assche 2016-09-01 15:50 ` Mike Snitzer 2016-09-01 16:12 ` Mike Snitzer 2016-09-01 17:59 ` Bart Van Assche 2016-09-01 19:05 ` Mike Snitzer 2016-09-01 19:35 ` Mike Snitzer 2016-09-01 20:15 ` Bart Van Assche 2016-09-01 20:33 ` Mike Snitzer 2016-09-01 20:39 ` Bart Van Assche 2016-09-01 20:48 ` Mike Snitzer 2016-09-01 20:52 ` Bart Van Assche 2016-09-01 21:17 ` Mike Snitzer 2016-09-01 22:18 ` Mike Snitzer 2016-09-01 22:22 ` Bart Van Assche 2016-09-01 22:26 ` Mike Snitzer 2016-09-01 23:17 ` Bart Van Assche 2016-09-01 23:47 ` Mike Snitzer 2016-09-02 0:03 ` Bart Van Assche 2016-09-02 15:12 ` Mike Snitzer 2016-09-02 16:10 ` should blk-mq halt requeue processing while queue is frozen? [was: Re: [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues] Mike Snitzer 2016-09-02 22:42 ` [dm-devel] should blk-mq halt requeue processing while queue is frozen? Bart Van Assche 2016-09-03 0:34 ` Mike Snitzer 2016-09-07 16:41 ` Mike Snitzer 2016-09-13 8:01 ` [dm-devel] " Bart Van Assche 2016-09-13 14:36 ` Mike Snitzer 2016-08-31 22:18 ` [PATCH 9/9] dm path selector: Avoid that device removal triggers an infinite loop Bart Van Assche 2016-09-01 2:49 ` Mike Snitzer 2016-09-01 14:14 ` Bart Van Assche 2016-09-01 15:06 ` Mike Snitzer 2016-09-01 15:22 ` Bart Van Assche 2016-09-01 15:26 ` Mike Snitzer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).