* [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces
@ 2022-10-20 3:53 Chao Leng
2022-10-20 3:53 ` [PATCH v3 1/2] blk-mq: add tagset quiesce interface Chao Leng
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Chao Leng @ 2022-10-20 3:53 UTC (permalink / raw)
To: linux-nvme, linux-block
Cc: hch, sagi, kbusch, lengchao, axboe, ming.lei, paulmck
Now nvme_stop_queues quiesce all queues one by one. Every queue must
wait a grace period(rcu or srcu). If the controller has a large amount
of namespaces, the total waiting time is very long.
Test result: the total waiting time is more than 20 seconds when the
controller has 256 namespace.
This set improves the quiesce time when using a large set of namespaces,
which also improves I/O failover time in a multipath environment.
We improve for both non-blocking tagset and blocking tagset introducing
blk_mq_[un]quiesce_tagset which works on all request queues over a given
tagset in parallel (which is the case in nvme).
Changes from v2:
- replace call_srcu to start_poll_synchronize_srcu
- rename the flag name to make it accurate.
- move unquiescing queue outside from nvme_set_queue_dying
- add mutex to ensure that all queues are quiesced after set the NVME_CTRL_STOPPED.
Changes from v1:
- improvement is based on tagset rather than namesapces
Chao Leng (2):
blk-mq: add tagset quiesce interface
nvme: use blk_mq_[un]quiesce_tagset
block/blk-mq.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/core.c | 57 +++++++++++++++---------------------
drivers/nvme/host/nvme.h | 3 +-
include/linux/blk-mq.h | 2 ++
include/linux/blkdev.h | 3 ++
5 files changed, 106 insertions(+), 35 deletions(-)
--
2.16.4
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH v3 1/2] blk-mq: add tagset quiesce interface 2022-10-20 3:53 [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Chao Leng @ 2022-10-20 3:53 ` Chao Leng 2022-10-20 16:11 ` Paul E. McKenney 2022-10-20 3:53 ` [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset Chao Leng 2022-10-20 6:34 ` [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Christoph Hellwig 2 siblings, 1 reply; 12+ messages in thread From: Chao Leng @ 2022-10-20 3:53 UTC (permalink / raw) To: linux-nvme, linux-block Cc: hch, sagi, kbusch, lengchao, axboe, ming.lei, paulmck Drivers that have shared tagsets may need to quiesce potentially a lot of request queues that all share a single tagset (e.g. nvme). Add an interface to quiesce all the queues on a given tagset. This interface is useful because it can speedup the quiesce by doing it in parallel. For tagsets that have BLK_MQ_F_BLOCKING set, we first call start_poll_synchronize_srcu for all queues of the tagset, and then call poll_state_synchronize_srcu such that all of them wait for the same srcu elapsed period. For tagsets that don't have BLK_MQ_F_BLOCKING set, we simply call a single synchronize_rcu as this is sufficient. Because some queues should not need to be quiesced(e.g. nvme connect_q) when quiesce the tagset. So introduce QUEUE_FLAG_SKIP_TAGSET_QUIESCE to tagset quiesce interface to skip the queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chao Leng <lengchao@huawei.com> --- block/blk-mq.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/blk-mq.h | 2 ++ include/linux/blkdev.h | 3 ++ 3 files changed, 81 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8070b6c10e8d..f064ecda425b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,6 +311,82 @@ void blk_mq_unquiesce_queue(struct request_queue *q) } EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue); +static void blk_mq_quiesce_blocking_tagset(struct blk_mq_tag_set *set) +{ + int i, count = 0; + struct request_queue *q; + unsigned long *rcu; + + list_for_each_entry(q, &set->tag_list, tag_set_list) { + if (blk_queue_skip_tagset_quiesce(q)) + continue; + + blk_mq_quiesce_queue_nowait(q); + count++; + } + + rcu = kvmalloc(count * sizeof(*rcu), GFP_KERNEL); + if (rcu) { + i = 0; + list_for_each_entry(q, &set->tag_list, tag_set_list) { + if (blk_queue_skip_tagset_quiesce(q)) + continue; + + rcu[i++] = start_poll_synchronize_srcu(q->srcu); + } + + i = 0; + list_for_each_entry(q, &set->tag_list, tag_set_list) { + if (blk_queue_skip_tagset_quiesce(q)) + continue; + + if (!poll_state_synchronize_srcu(q->srcu, rcu[i++])) + synchronize_srcu(q->srcu); + } + + kvfree(rcu); + } else { + list_for_each_entry(q, &set->tag_list, tag_set_list) + synchronize_srcu(q->srcu); + } +} + +static void blk_mq_quiesce_nonblocking_tagset(struct blk_mq_tag_set *set) +{ + struct request_queue *q; + + list_for_each_entry(q, &set->tag_list, tag_set_list) { + if (blk_queue_skip_tagset_quiesce(q)) + continue; + + blk_mq_quiesce_queue_nowait(q); + } + synchronize_rcu(); +} + +void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set) +{ + mutex_lock(&set->tag_list_lock); + if (set->flags & BLK_MQ_F_BLOCKING) + blk_mq_quiesce_blocking_tagset(set); + else + blk_mq_quiesce_nonblocking_tagset(set); + + mutex_unlock(&set->tag_list_lock); +} +EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset); + +void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set) +{ + struct request_queue *q; + + mutex_lock(&set->tag_list_lock); + list_for_each_entry(q, &set->tag_list, tag_set_list) + blk_mq_unquiesce_queue(q); + mutex_unlock(&set->tag_list_lock); +} +EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset); + void blk_mq_wake_waiters(struct request_queue *q) { struct blk_mq_hw_ctx *hctx; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index ba18e9bdb799..1df47606d0a7 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -877,6 +877,8 @@ void blk_mq_start_hw_queues(struct request_queue *q); void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async); void blk_mq_quiesce_queue(struct request_queue *q); +void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set); +void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set); void blk_mq_wait_quiesce_done(struct request_queue *q); void blk_mq_unquiesce_queue(struct request_queue *q); void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 50e358a19d98..efa3fa771dce 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -579,6 +579,7 @@ struct request_queue { #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */ #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ +#define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/ #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \ (1UL << QUEUE_FLAG_SAME_COMP) | \ @@ -619,6 +620,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) #define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) +#define blk_queue_skip_tagset_quiesce(q) \ + test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags) extern void blk_set_pm_only(struct request_queue *q); extern void blk_clear_pm_only(struct request_queue *q); -- 2.16.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3 1/2] blk-mq: add tagset quiesce interface 2022-10-20 3:53 ` [PATCH v3 1/2] blk-mq: add tagset quiesce interface Chao Leng @ 2022-10-20 16:11 ` Paul E. McKenney 0 siblings, 0 replies; 12+ messages in thread From: Paul E. McKenney @ 2022-10-20 16:11 UTC (permalink / raw) To: Chao Leng; +Cc: linux-nvme, linux-block, hch, sagi, kbusch, axboe, ming.lei On Thu, Oct 20, 2022 at 11:53:47AM +0800, Chao Leng wrote: > Drivers that have shared tagsets may need to quiesce potentially a lot > of request queues that all share a single tagset (e.g. nvme). Add an > interface to quiesce all the queues on a given tagset. This interface is > useful because it can speedup the quiesce by doing it in parallel. > > For tagsets that have BLK_MQ_F_BLOCKING set, we first call > start_poll_synchronize_srcu for all queues of the tagset, and then call > poll_state_synchronize_srcu such that all of them wait for the same srcu > elapsed period. For tagsets that don't have BLK_MQ_F_BLOCKING set, > we simply call a single synchronize_rcu as this is sufficient. > > Because some queues should not need to be quiesced(e.g. nvme connect_q) > when quiesce the tagset. So introduce QUEUE_FLAG_SKIP_TAGSET_QUIESCE to > tagset quiesce interface to skip the queue. > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > Signed-off-by: Chao Leng <lengchao@huawei.com> From an RCU viewpoint: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> As noted in the earlier email thread, using a single srcu_struct for the whole tag list would require even less memory and would further reduce the overhead and latency of blk_mq_quiesce_blocking_tagset(). Thanx, Paul > --- > block/blk-mq.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/blk-mq.h | 2 ++ > include/linux/blkdev.h | 3 ++ > 3 files changed, 81 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8070b6c10e8d..f064ecda425b 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -311,6 +311,82 @@ void blk_mq_unquiesce_queue(struct request_queue *q) > } > EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue); > > +static void blk_mq_quiesce_blocking_tagset(struct blk_mq_tag_set *set) > +{ > + int i, count = 0; > + struct request_queue *q; > + unsigned long *rcu; > + > + list_for_each_entry(q, &set->tag_list, tag_set_list) { > + if (blk_queue_skip_tagset_quiesce(q)) > + continue; > + > + blk_mq_quiesce_queue_nowait(q); > + count++; > + } > + > + rcu = kvmalloc(count * sizeof(*rcu), GFP_KERNEL); > + if (rcu) { > + i = 0; > + list_for_each_entry(q, &set->tag_list, tag_set_list) { > + if (blk_queue_skip_tagset_quiesce(q)) > + continue; > + > + rcu[i++] = start_poll_synchronize_srcu(q->srcu); > + } > + > + i = 0; > + list_for_each_entry(q, &set->tag_list, tag_set_list) { > + if (blk_queue_skip_tagset_quiesce(q)) > + continue; > + > + if (!poll_state_synchronize_srcu(q->srcu, rcu[i++])) > + synchronize_srcu(q->srcu); > + } > + > + kvfree(rcu); > + } else { > + list_for_each_entry(q, &set->tag_list, tag_set_list) > + synchronize_srcu(q->srcu); > + } > +} > + > +static void blk_mq_quiesce_nonblocking_tagset(struct blk_mq_tag_set *set) > +{ > + struct request_queue *q; > + > + list_for_each_entry(q, &set->tag_list, tag_set_list) { > + if (blk_queue_skip_tagset_quiesce(q)) > + continue; > + > + blk_mq_quiesce_queue_nowait(q); > + } > + synchronize_rcu(); > +} > + > +void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set) > +{ > + mutex_lock(&set->tag_list_lock); > + if (set->flags & BLK_MQ_F_BLOCKING) > + blk_mq_quiesce_blocking_tagset(set); > + else > + blk_mq_quiesce_nonblocking_tagset(set); > + > + mutex_unlock(&set->tag_list_lock); > +} > +EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset); > + > +void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set) > +{ > + struct request_queue *q; > + > + mutex_lock(&set->tag_list_lock); > + list_for_each_entry(q, &set->tag_list, tag_set_list) > + blk_mq_unquiesce_queue(q); > + mutex_unlock(&set->tag_list_lock); > +} > +EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset); > + > void blk_mq_wake_waiters(struct request_queue *q) > { > struct blk_mq_hw_ctx *hctx; > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index ba18e9bdb799..1df47606d0a7 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -877,6 +877,8 @@ void blk_mq_start_hw_queues(struct request_queue *q); > void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); > void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async); > void blk_mq_quiesce_queue(struct request_queue *q); > +void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set); > +void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set); > void blk_mq_wait_quiesce_done(struct request_queue *q); > void blk_mq_unquiesce_queue(struct request_queue *q); > void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 50e358a19d98..efa3fa771dce 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -579,6 +579,7 @@ struct request_queue { > #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */ > #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ > #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ > +#define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/ > > #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \ > (1UL << QUEUE_FLAG_SAME_COMP) | \ > @@ -619,6 +620,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); > #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) > #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) > #define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) > +#define blk_queue_skip_tagset_quiesce(q) \ > + test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags) > > extern void blk_set_pm_only(struct request_queue *q); > extern void blk_clear_pm_only(struct request_queue *q); > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset 2022-10-20 3:53 [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Chao Leng 2022-10-20 3:53 ` [PATCH v3 1/2] blk-mq: add tagset quiesce interface Chao Leng @ 2022-10-20 3:53 ` Chao Leng 2022-10-20 6:11 ` Sagi Grimberg 2022-10-20 6:34 ` [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Christoph Hellwig 2 siblings, 1 reply; 12+ messages in thread From: Chao Leng @ 2022-10-20 3:53 UTC (permalink / raw) To: linux-nvme, linux-block Cc: hch, sagi, kbusch, lengchao, axboe, ming.lei, paulmck All controller namespaces share the same tagset, so we can use this interface which does the optimal operation for parallel quiesce based on the tagset type(e.g. blocking tagsets and non-blocking tagsets). nvme connect_q should not be quiesced when quiesce tagset, so set the QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q. Currntely we use NVME_NS_STOPPED to ensure pairing quiescing and unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED. In addition, we never really quiesce a single namespace. It is a better choice to move the flag from ns to ctrl. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chao Leng <lengchao@huawei.com> --- drivers/nvme/host/core.c | 57 +++++++++++++++++++----------------------------- drivers/nvme/host/nvme.h | 3 ++- 2 files changed, 25 insertions(+), 35 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 059737c1a2c1..c7727d1f228e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4890,6 +4890,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, ret = PTR_ERR(ctrl->connect_q); goto out_free_tag_set; } + blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, ctrl->connect_q); } ctrl->tagset = set; @@ -5013,6 +5014,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); spin_lock_init(&ctrl->lock); mutex_init(&ctrl->scan_lock); + mutex_init(&ctrl->queue_state_lock); INIT_LIST_HEAD(&ctrl->namespaces); xa_init(&ctrl->cels); init_rwsem(&ctrl->namespaces_rwsem); @@ -5089,36 +5091,21 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, } EXPORT_SYMBOL_GPL(nvme_init_ctrl); -static void nvme_start_ns_queue(struct nvme_ns *ns) -{ - if (test_and_clear_bit(NVME_NS_STOPPED, &ns->flags)) - blk_mq_unquiesce_queue(ns->queue); -} - -static void nvme_stop_ns_queue(struct nvme_ns *ns) -{ - if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags)) - blk_mq_quiesce_queue(ns->queue); - else - blk_mq_wait_quiesce_done(ns->queue); -} - /* * Prepare a queue for teardown. * - * This must forcibly unquiesce queues to avoid blocking dispatch, and only set - * the capacity to 0 after that to avoid blocking dispatchers that may be - * holding bd_butex. This will end buffered writers dirtying pages that can't - * be synced. + * The caller should unquiesce the queue to avoid blocking dispatch. */ static void nvme_set_queue_dying(struct nvme_ns *ns) { if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return; + /* + * Mark the disk dead to prevent new opens, and set the capacity to 0 + * to end buffered writers dirtying pages that can't be synced. + */ blk_mark_disk_dead(ns->disk); - nvme_start_ns_queue(ns); - set_capacity_and_notify(ns->disk, 0); } @@ -5134,15 +5121,17 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl) struct nvme_ns *ns; down_read(&ctrl->namespaces_rwsem); + list_for_each_entry(ns, &ctrl->namespaces, list) + nvme_set_queue_dying(ns); + + up_read(&ctrl->namespaces_rwsem); /* Forcibly unquiesce queues to avoid blocking dispatch */ if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q)) nvme_start_admin_queue(ctrl); - list_for_each_entry(ns, &ctrl->namespaces, list) - nvme_set_queue_dying(ns); - - up_read(&ctrl->namespaces_rwsem); + if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags)) + nvme_start_queues(ctrl); } EXPORT_SYMBOL_GPL(nvme_kill_queues); @@ -5196,23 +5185,23 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze); void nvme_stop_queues(struct nvme_ctrl *ctrl) { - struct nvme_ns *ns; + mutex_lock(&ctrl->queue_state_lock); - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) - nvme_stop_ns_queue(ns); - up_read(&ctrl->namespaces_rwsem); + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) + blk_mq_quiesce_tagset(ctrl->tagset); + + mutex_unlock(&ctrl->queue_state_lock); } EXPORT_SYMBOL_GPL(nvme_stop_queues); void nvme_start_queues(struct nvme_ctrl *ctrl) { - struct nvme_ns *ns; + mutex_lock(&ctrl->queue_state_lock); - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) - nvme_start_ns_queue(ns); - up_read(&ctrl->namespaces_rwsem); + if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags)) + blk_mq_unquiesce_tagset(ctrl->tagset); + + mutex_unlock(&ctrl->queue_state_lock); } EXPORT_SYMBOL_GPL(nvme_start_queues); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index a29877217ee6..23be055fb425 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -237,6 +237,7 @@ enum nvme_ctrl_flags { NVME_CTRL_FAILFAST_EXPIRED = 0, NVME_CTRL_ADMIN_Q_STOPPED = 1, NVME_CTRL_STARTED_ONCE = 2, + NVME_CTRL_STOPPED = 3, }; struct nvme_ctrl { @@ -245,6 +246,7 @@ struct nvme_ctrl { bool identified; spinlock_t lock; struct mutex scan_lock; + struct mutex queue_state_lock; const struct nvme_ctrl_ops *ops; struct request_queue *admin_q; struct request_queue *connect_q; @@ -487,7 +489,6 @@ struct nvme_ns { #define NVME_NS_ANA_PENDING 2 #define NVME_NS_FORCE_RO 3 #define NVME_NS_READY 4 -#define NVME_NS_STOPPED 5 struct cdev cdev; struct device cdev_device; -- 2.16.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset 2022-10-20 3:53 ` [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset Chao Leng @ 2022-10-20 6:11 ` Sagi Grimberg 2022-10-20 9:47 ` Chao Leng 0 siblings, 1 reply; 12+ messages in thread From: Sagi Grimberg @ 2022-10-20 6:11 UTC (permalink / raw) To: Chao Leng, linux-nvme, linux-block; +Cc: hch, kbusch, axboe, ming.lei, paulmck On 10/20/22 06:53, Chao Leng wrote: > All controller namespaces share the same tagset, so we can use this > interface which does the optimal operation for parallel quiesce based on > the tagset type(e.g. blocking tagsets and non-blocking tagsets). > > nvme connect_q should not be quiesced when quiesce tagset, so set the > QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q. > > Currntely we use NVME_NS_STOPPED to ensure pairing quiescing and > unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be > invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED. > In addition, we never really quiesce a single namespace. It is a better > choice to move the flag from ns to ctrl. > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > Signed-off-by: Chao Leng <lengchao@huawei.com> > --- > drivers/nvme/host/core.c | 57 +++++++++++++++++++----------------------------- > drivers/nvme/host/nvme.h | 3 ++- > 2 files changed, 25 insertions(+), 35 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 059737c1a2c1..c7727d1f228e 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -4890,6 +4890,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, > ret = PTR_ERR(ctrl->connect_q); > goto out_free_tag_set; > } > + blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, ctrl->connect_q); > } > > ctrl->tagset = set; > @@ -5013,6 +5014,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, > clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); > spin_lock_init(&ctrl->lock); > mutex_init(&ctrl->scan_lock); > + mutex_init(&ctrl->queue_state_lock); Why is this lock needed? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset 2022-10-20 6:11 ` Sagi Grimberg @ 2022-10-20 9:47 ` Chao Leng 0 siblings, 0 replies; 12+ messages in thread From: Chao Leng @ 2022-10-20 9:47 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme, linux-block Cc: hch, kbusch, axboe, ming.lei, paulmck On 2022/10/20 14:11, Sagi Grimberg wrote: > > > On 10/20/22 06:53, Chao Leng wrote: >> All controller namespaces share the same tagset, so we can use this >> interface which does the optimal operation for parallel quiesce based on >> the tagset type(e.g. blocking tagsets and non-blocking tagsets). >> >> nvme connect_q should not be quiesced when quiesce tagset, so set the >> QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q. >> >> Currntely we use NVME_NS_STOPPED to ensure pairing quiescing and >> unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be >> invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED. >> In addition, we never really quiesce a single namespace. It is a better >> choice to move the flag from ns to ctrl. >> >> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >> Signed-off-by: Chao Leng <lengchao@huawei.com> >> --- >> drivers/nvme/host/core.c | 57 +++++++++++++++++++----------------------------- >> drivers/nvme/host/nvme.h | 3 ++- >> 2 files changed, 25 insertions(+), 35 deletions(-) >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 059737c1a2c1..c7727d1f228e 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -4890,6 +4890,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, >> ret = PTR_ERR(ctrl->connect_q); >> goto out_free_tag_set; >> } >> + blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, ctrl->connect_q); >> } >> ctrl->tagset = set; >> @@ -5013,6 +5014,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, >> clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); >> spin_lock_init(&ctrl->lock); >> mutex_init(&ctrl->scan_lock); >> + mutex_init(&ctrl->queue_state_lock); > > Why is this lock needed? It is used to secure the process which need that the queue must be quiesced. The scenario:(without the lock) Thread A: call nvme_stop_queues and set the NVME_CTRL_STOPPED and then quiesce the tagset and wait the grace period. Thread B: call nvme_stop_queues, because the NVME_CTRL_STOPPED is already setted, continue to do something which need that the queue must be quiesced, because the grace period of the queue is not ended, may cause abnormal. Thread A: the grace period end, and continue. So add a lock to ensure that all queues are quiesced after set the NVME_CTRL_STOPPED. The old code was implemented by forcing a wait for the grace period. Show the code: if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags)) blk_mq_quiesce_queue(ns->queue); else blk_mq_wait_quiesce_done(ns->queue); The old code was not absolutely safe, such as this scenario: Thread A: test_and_set_bit, and interrupt by hardware irq, lost chance to run. Thread B: test_and_set_bit, and wait the grace period, and then continue to do something which need that the queue must be quiesced, because the queue is not quiesced, may cause abnormal. Thread A: get the chance to run, blk_mq_quiesce_queue, and then continue. > > . ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces 2022-10-20 3:53 [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Chao Leng 2022-10-20 3:53 ` [PATCH v3 1/2] blk-mq: add tagset quiesce interface Chao Leng 2022-10-20 3:53 ` [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset Chao Leng @ 2022-10-20 6:34 ` Christoph Hellwig 2022-10-20 9:36 ` Chao Leng 2 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2022-10-20 6:34 UTC (permalink / raw) To: Chao Leng Cc: linux-nvme, linux-block, hch, sagi, kbusch, axboe, ming.lei, paulmck I though the conclusion from last round was to move the srcu_struct to the tagset? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces 2022-10-20 6:34 ` [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Christoph Hellwig @ 2022-10-20 9:36 ` Chao Leng 0 siblings, 0 replies; 12+ messages in thread From: Chao Leng @ 2022-10-20 9:36 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-nvme, linux-block, sagi, kbusch, axboe, ming.lei, paulmck On 2022/10/20 14:34, Christoph Hellwig wrote: > I though the conclusion from last round was to move the srcu_struct to > the tagset? If move the srcu_struct to the tagset, may loss the flexibility. I am not sure that it is good for other modules currently using blk_mq_quiesce_queue. Another, I am not sure if there will be a future scenario where blk_mq_quiesce_queue will have to be used, and if it is good for such scenario. It is a acceptable cost to allocate a temporary array for SRCU, the max memory size is actually a few hundred KB, and most of the time it's less than a few KB. So I did not move the srcu_struct to the tagset in patch V3. It sounds like a good idea that explore moving the srcu_struct to the tag_set, But we may need more time to analysis it. I suggest that do the optimization in a separate patch set. > > . > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces @ 2022-10-13 9:44 Chao Leng 2022-10-13 14:32 ` Chaitanya Kulkarni 0 siblings, 1 reply; 12+ messages in thread From: Chao Leng @ 2022-10-13 9:44 UTC (permalink / raw) To: linux-nvme, linux-block; +Cc: hch, sagi, kbusch, lengchao, ming.lei, axboe This set improves the quiesce time when using a large set of namespaces, which also improves I/O failover time in a multipath environment. We improve for both non-blocking hctxs and blocking hctxs introducing blk_mq_[un]quiesce_tagset which works on all request queues over a given tagset in parallel (which is the case in nvme) for both tagset types (blocking and non-blocking); Changes from v1: - improvement is based on tagset rather than namesapces Chao Leng (2): blk-mq: add tagset quiesce interface nvme: use blk_mq_[un]quiesce_tagset block/blk-mq.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++ drivers/nvme/host/core.c | 42 +++++++++------------------ drivers/nvme/host/nvme.h | 2 +- include/linux/blk-mq.h | 2 ++ include/linux/blkdev.h | 2 ++ 5 files changed, 93 insertions(+), 30 deletions(-) -- 2.16.4 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces 2022-10-13 9:44 Chao Leng @ 2022-10-13 14:32 ` Chaitanya Kulkarni 2022-10-14 2:12 ` Chao Leng 0 siblings, 1 reply; 12+ messages in thread From: Chaitanya Kulkarni @ 2022-10-13 14:32 UTC (permalink / raw) To: Chao Leng, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org Cc: hch@lst.de, sagi@grimberg.me, kbusch@kernel.org, ming.lei@redhat.com, axboe@kernel.dk On 10/13/22 02:44, Chao Leng wrote: > This set improves the quiesce time when using a large set of > namespaces, which also improves I/O failover time in a multipath environment. > by how much and what are the problems exits with current timing needs to be documented, i.e. add quantitative data if you are posting patches for time/performance improvement. -ck ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces 2022-10-13 14:32 ` Chaitanya Kulkarni @ 2022-10-14 2:12 ` Chao Leng 2022-10-15 0:30 ` Chaitanya Kulkarni 0 siblings, 1 reply; 12+ messages in thread From: Chao Leng @ 2022-10-14 2:12 UTC (permalink / raw) To: Chaitanya Kulkarni, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org Cc: hch@lst.de, sagi@grimberg.me, kbusch@kernel.org, ming.lei@redhat.com, axboe@kernel.dk On 2022/10/13 22:32, Chaitanya Kulkarni wrote: > On 10/13/22 02:44, Chao Leng wrote: >> This set improves the quiesce time when using a large set of >> namespaces, which also improves I/O failover time in a multipath environment. >> > > by how much and what are the problems exits with current timing needs to > be documented, i.e. add quantitative data if you are posting patches for > time/performance improvement. Theoretically, every synchronize_rcu/synchronize_srcu need to wait more than 10ms. Now nvme quiesce all queues of namespace one by one, The more namespace, the longer the waiting time. The test result: Test Scenario: nvme over roce with multipathing, 256 namespaces When send I/Os to all namespaces, simulate a path fault, the fail over waiting time is more than 20 seconds. After analysis, it was found that the total of quiesce waiting time for 256 namespace is more than 20 seconds. After optimization, same scenario the fail over waiting time is less than 1 second. > > -ck > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces 2022-10-14 2:12 ` Chao Leng @ 2022-10-15 0:30 ` Chaitanya Kulkarni 0 siblings, 0 replies; 12+ messages in thread From: Chaitanya Kulkarni @ 2022-10-15 0:30 UTC (permalink / raw) To: Chao Leng Cc: hch@lst.de, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, sagi@grimberg.me, kbusch@kernel.org, ming.lei@redhat.com, axboe@kernel.dk > The test result: > Test Scenario: nvme over roce with multipathing, 256 namespaces > When send I/Os to all namespaces, simulate a path fault, the fail over > waiting time > is more than 20 seconds. > After analysis, it was found that the total of quiesce waiting time for > 256 namespace is > more than 20 seconds. > After optimization, same scenario the fail over waiting time is less > than 1 second. I'm not questioning what you are doing, please add these to the cover-letter or patches for reference. -ck ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-10-20 16:11 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-20 3:53 [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Chao Leng 2022-10-20 3:53 ` [PATCH v3 1/2] blk-mq: add tagset quiesce interface Chao Leng 2022-10-20 16:11 ` Paul E. McKenney 2022-10-20 3:53 ` [PATCH v3 2/2] nvme: use blk_mq_[un]quiesce_tagset Chao Leng 2022-10-20 6:11 ` Sagi Grimberg 2022-10-20 9:47 ` Chao Leng 2022-10-20 6:34 ` [PATCH v2 0/2] improve nvme quiesce time for large amount of namespaces Christoph Hellwig 2022-10-20 9:36 ` Chao Leng -- strict thread matches above, loose matches on Subject: below -- 2022-10-13 9:44 Chao Leng 2022-10-13 14:32 ` Chaitanya Kulkarni 2022-10-14 2:12 ` Chao Leng 2022-10-15 0:30 ` Chaitanya Kulkarni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox