* [PATCH 0/3] NVMe multipath path selector enhancements
@ 2023-09-25 16:31 Ewan D. Milne
2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw)
To: linux-nvme; +Cc: tsong, jmeneghi, mlombard
This patch series adds a "queue-depth" iopolicy to the NVMe multipath
path selector code. It also adds the capability to optionally allow
the use of ANA nonoptimized paths as well as optimized paths, this can
be beneficial in some cases and helps to provide additional paths to
verify the desired target port utilization.
Signend-off-by: Ewan D. Milne <emilne@redhat.com>
Ewan D. Milne (3):
block: introduce blk_queue_nr_active()
Implemented new iopolicy "queue-depth"
nvme-multipath: add "use_nonoptimized" module option
block/blk-mq.h | 5 ---
drivers/nvme/host/multipath.c | 70 ++++++++++++++++++++++++++++++++---
drivers/nvme/host/nvme.h | 1 +
include/linux/blk-mq.h | 33 +++++++++++++----
4 files changed, 92 insertions(+), 17 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 24+ messages in thread* [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne @ 2023-09-25 16:31 ` Ewan D. Milne 2023-09-25 20:56 ` Bart Van Assche ` (3 more replies) 2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne 2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne 2 siblings, 4 replies; 24+ messages in thread From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw) To: linux-nvme; +Cc: tsong, jmeneghi, mlombard Returns a count of the total number of active requests in a queue. For non-shared tags (the usual case) this is the sum of nr_active from all of the hctxs. Signed-off-by: Ewan D. Milne <emilne@redhat.com> --- block/blk-mq.h | 5 ----- include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++------- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/block/blk-mq.h b/block/blk-mq.h index 1743857e0b01..fbc65eefa017 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, return tag < tags->nr_reserved_tags; } -static inline bool blk_mq_is_shared_tags(unsigned int flags) -{ - return flags & BLK_MQ_F_TAG_HCTX_SHARED; -} - static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) { if (data->rq_flags & RQF_SCHED_TAGS) diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 01e8c31db665..c921ae5236ab 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, bool blk_mq_queue_inflight(struct request_queue *q); +#define queue_for_each_hw_ctx(q, hctx, i) \ + xa_for_each(&(q)->hctx_table, (i), (hctx)) + +#define hctx_for_each_ctx(hctx, ctx, i) \ + for ((i) = 0; (i) < (hctx)->nr_ctx && \ + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) + +static inline bool blk_mq_is_shared_tags(unsigned int flags) +{ + return flags & BLK_MQ_F_TAG_HCTX_SHARED; +} + +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) +{ + unsigned int nr_active = 0; + struct blk_mq_hw_ctx *hctx; + unsigned long i; + + queue_for_each_hw_ctx(q, hctx, i) { + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) + return atomic_read(&q->nr_active_requests_shared_tags); + nr_active += atomic_read(&hctx->nr_active); + } + return nr_active; +} + enum { /* return when out of requests */ BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0), @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq) return rq + 1; } -#define queue_for_each_hw_ctx(q, hctx, i) \ - xa_for_each(&(q)->hctx_table, (i), (hctx)) - -#define hctx_for_each_ctx(hctx, ctx, i) \ - for ((i) = 0; (i) < (hctx)->nr_ctx && \ - ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) - static inline void blk_mq_cleanup_rq(struct request *rq) { if (rq->q->mq_ops->cleanup_rq) -- 2.20.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne @ 2023-09-25 20:56 ` Bart Van Assche 2023-09-27 14:49 ` Ewan Milne 2023-09-27 7:36 ` Hannes Reinecke ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Bart Van Assche @ 2023-09-25 20:56 UTC (permalink / raw) To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard On 9/25/23 09:31, Ewan D. Milne wrote: > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > +{ > + unsigned int nr_active = 0; > + struct blk_mq_hw_ctx *hctx; > + unsigned long i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > + return atomic_read(&q->nr_active_requests_shared_tags); > + nr_active += atomic_read(&hctx->nr_active); > + } > + return nr_active; > +} The above function should never be called from any command submission code path. Hence, I think it should be added in a .c file instead of include/linux/blk-mq.h. Thanks, Bart. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-25 20:56 ` Bart Van Assche @ 2023-09-27 14:49 ` Ewan Milne 2023-09-27 14:58 ` Bart Van Assche 0 siblings, 1 reply; 24+ messages in thread From: Ewan Milne @ 2023-09-27 14:49 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-nvme, tsong, jmeneghi, mlombard The function is only called in one place by the queue-depth nvme path selector in a later patch in the series, so it is called in the submission path, which is why I made it inline. Or did you mean that the function is too expensive? See other discussions. -Ewan On Mon, Sep 25, 2023 at 4:56 PM Bart Van Assche <bvanassche@acm.org> wrote: > > On 9/25/23 09:31, Ewan D. Milne wrote: > > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > > +{ > > + unsigned int nr_active = 0; > > + struct blk_mq_hw_ctx *hctx; > > + unsigned long i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > > + return atomic_read(&q->nr_active_requests_shared_tags); > > + nr_active += atomic_read(&hctx->nr_active); > > + } > > + return nr_active; > > +} > > The above function should never be called from any command submission > code path. Hence, I think it should be added in a .c file instead of > include/linux/blk-mq.h. > > Thanks, > > Bart. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 14:49 ` Ewan Milne @ 2023-09-27 14:58 ` Bart Van Assche 0 siblings, 0 replies; 24+ messages in thread From: Bart Van Assche @ 2023-09-27 14:58 UTC (permalink / raw) To: Ewan Milne; +Cc: linux-nvme, tsong, jmeneghi, mlombard On 9/27/23 07:49, Ewan Milne wrote: > On Mon, Sep 25, 2023 at 4:56 PM Bart Van Assche <bvanassche@acm.org> wrote: >> On 9/25/23 09:31, Ewan D. Milne wrote: >>> +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) >>> +{ >>> + unsigned int nr_active = 0; >>> + struct blk_mq_hw_ctx *hctx; >>> + unsigned long i; >>> + >>> + queue_for_each_hw_ctx(q, hctx, i) { >>> + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) >>> + return atomic_read(&q->nr_active_requests_shared_tags); >>> + nr_active += atomic_read(&hctx->nr_active); >>> + } >>> + return nr_active; >>> +} >> >> The above function should never be called from any command submission >> code path. Hence, I think it should be added in a .c file instead of >> include/linux/blk-mq.h. > > The function is only called in one place by the queue-depth nvme path > selector in a > later patch in the series, so it is called in the submission path, > which is why I made it inline. Hi Ewan, In order to keep include/linux/blk-mq.h readable and in order to minimize the kernel compilation time, nontrivial functions that are not called from the hot path should occur in a .c file instead of in a .h file. Thanks, Bart. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne 2023-09-25 20:56 ` Bart Van Assche @ 2023-09-27 7:36 ` Hannes Reinecke 2023-09-27 11:37 ` Ming Lei 2023-09-27 13:34 ` Ewan Milne 2023-09-27 9:49 ` Ming Lei 2023-09-27 10:56 ` Sagi Grimberg 3 siblings, 2 replies; 24+ messages in thread From: Hannes Reinecke @ 2023-09-27 7:36 UTC (permalink / raw) To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard On 9/25/23 18:31, Ewan D. Milne wrote: > Returns a count of the total number of active requests > in a queue. For non-shared tags (the usual case) this is > the sum of nr_active from all of the hctxs. > Bit of an exaggeration here. Shared tags are in use if the hardware supports only a global tag space (ie basically all SCSI and FC HBAs). > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > --- > block/blk-mq.h | 5 ----- > include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++------- > 2 files changed, 26 insertions(+), 12 deletions(-) > > diff --git a/block/blk-mq.h b/block/blk-mq.h > index 1743857e0b01..fbc65eefa017 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, > return tag < tags->nr_reserved_tags; > } > > -static inline bool blk_mq_is_shared_tags(unsigned int flags) > -{ > - return flags & BLK_MQ_F_TAG_HCTX_SHARED; > -} > - > static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) > { > if (data->rq_flags & RQF_SCHED_TAGS) > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index 01e8c31db665..c921ae5236ab 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, > > bool blk_mq_queue_inflight(struct request_queue *q); > > +#define queue_for_each_hw_ctx(q, hctx, i) \ > + xa_for_each(&(q)->hctx_table, (i), (hctx)) > + > +#define hctx_for_each_ctx(hctx, ctx, i) \ > + for ((i) = 0; (i) < (hctx)->nr_ctx && \ > + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > + > +static inline bool blk_mq_is_shared_tags(unsigned int flags) > +{ > + return flags & BLK_MQ_F_TAG_HCTX_SHARED; > +} > + > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > +{ > + unsigned int nr_active = 0; > + struct blk_mq_hw_ctx *hctx; > + unsigned long i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > + return atomic_read(&q->nr_active_requests_shared_tags); > + nr_active += atomic_read(&hctx->nr_active); > + } > + return nr_active; > +} > + > enum { > /* return when out of requests */ > BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0), > @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq) > return rq + 1; > } > > -#define queue_for_each_hw_ctx(q, hctx, i) \ > - xa_for_each(&(q)->hctx_table, (i), (hctx)) > - > -#define hctx_for_each_ctx(hctx, ctx, i) \ > - for ((i) = 0; (i) < (hctx)->nr_ctx && \ > - ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > - > static inline void blk_mq_cleanup_rq(struct request *rq) > { > if (rq->q->mq_ops->cleanup_rq) Well. As discussed, using xarray on 'small' arrays is horrible for performance. We really should revert the patch from Ming to turn it back into a simple array; that'll make traversing much faster. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 7:36 ` Hannes Reinecke @ 2023-09-27 11:37 ` Ming Lei 2023-09-27 13:34 ` Ewan Milne 1 sibling, 0 replies; 24+ messages in thread From: Ming Lei @ 2023-09-27 11:37 UTC (permalink / raw) To: Hannes Reinecke Cc: Ewan D. Milne, linux-nvme, tsong, jmeneghi, mlombard, ming.lei On Wed, Sep 27, 2023 at 09:36:11AM +0200, Hannes Reinecke wrote: > On 9/25/23 18:31, Ewan D. Milne wrote: > > Returns a count of the total number of active requests > > in a queue. For non-shared tags (the usual case) this is > > the sum of nr_active from all of the hctxs. > > > Bit of an exaggeration here. > Shared tags are in use if the hardware supports only a global tag space > (ie basically all SCSI and FC HBAs). > > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > > --- > > block/blk-mq.h | 5 ----- > > include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++------- > > 2 files changed, 26 insertions(+), 12 deletions(-) > > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > index 1743857e0b01..fbc65eefa017 100644 > > --- a/block/blk-mq.h > > +++ b/block/blk-mq.h > > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, > > return tag < tags->nr_reserved_tags; > > } > > -static inline bool blk_mq_is_shared_tags(unsigned int flags) > > -{ > > - return flags & BLK_MQ_F_TAG_HCTX_SHARED; > > -} > > - > > static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) > > { > > if (data->rq_flags & RQF_SCHED_TAGS) > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > > index 01e8c31db665..c921ae5236ab 100644 > > --- a/include/linux/blk-mq.h > > +++ b/include/linux/blk-mq.h > > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, > > bool blk_mq_queue_inflight(struct request_queue *q); > > +#define queue_for_each_hw_ctx(q, hctx, i) \ > > + xa_for_each(&(q)->hctx_table, (i), (hctx)) > > + > > +#define hctx_for_each_ctx(hctx, ctx, i) \ > > + for ((i) = 0; (i) < (hctx)->nr_ctx && \ > > + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > > + > > +static inline bool blk_mq_is_shared_tags(unsigned int flags) > > +{ > > + return flags & BLK_MQ_F_TAG_HCTX_SHARED; > > +} > > + > > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > > +{ > > + unsigned int nr_active = 0; > > + struct blk_mq_hw_ctx *hctx; > > + unsigned long i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > > + return atomic_read(&q->nr_active_requests_shared_tags); > > + nr_active += atomic_read(&hctx->nr_active); > > + } > > + return nr_active; > > +} > > + > > enum { > > /* return when out of requests */ > > BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0), > > @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq) > > return rq + 1; > > } > > -#define queue_for_each_hw_ctx(q, hctx, i) \ > > - xa_for_each(&(q)->hctx_table, (i), (hctx)) > > - > > -#define hctx_for_each_ctx(hctx, ctx, i) \ > > - for ((i) = 0; (i) < (hctx)->nr_ctx && \ > > - ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > > - > > static inline void blk_mq_cleanup_rq(struct request *rq) > > { > > if (rq->q->mq_ops->cleanup_rq) > > Well. As discussed, using xarray on 'small' arrays is horrible for > performance. We really should revert the patch from Ming to > turn it back into a simple array; that'll make traversing much faster. But queue_for_each_hw_ctx() isn't supposed for fast path, and it is always expensive to run cross-queue things. It also means any way running cross-queue thing in fast path may not be one good idea. Thanks, Ming ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 7:36 ` Hannes Reinecke 2023-09-27 11:37 ` Ming Lei @ 2023-09-27 13:34 ` Ewan Milne 2023-10-03 20:11 ` Uday Shankar 1 sibling, 1 reply; 24+ messages in thread From: Ewan Milne @ 2023-09-27 13:34 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-nvme, tsong, jmeneghi, mlombard Maybe I missed something but I only found 2 cases that set BLK_MQ_F_TAG_HCTX_SHARED. For NVMe/TCP, at least on the system I used that has 80 cores, there were only 1 hctxs but I haven't looked into how many e.g. we have for PCI devices. FC I think has a smallish number. -Ewan On Wed, Sep 27, 2023 at 3:36 AM Hannes Reinecke <hare@suse.de> wrote: > > On 9/25/23 18:31, Ewan D. Milne wrote: > > Returns a count of the total number of active requests > > in a queue. For non-shared tags (the usual case) this is > > the sum of nr_active from all of the hctxs. > > > Bit of an exaggeration here. > Shared tags are in use if the hardware supports only a global tag space > (ie basically all SCSI and FC HBAs). > > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > > --- > > block/blk-mq.h | 5 ----- > > include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++------- > > 2 files changed, 26 insertions(+), 12 deletions(-) > > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > index 1743857e0b01..fbc65eefa017 100644 > > --- a/block/blk-mq.h > > +++ b/block/blk-mq.h > > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, > > return tag < tags->nr_reserved_tags; > > } > > > > -static inline bool blk_mq_is_shared_tags(unsigned int flags) > > -{ > > - return flags & BLK_MQ_F_TAG_HCTX_SHARED; > > -} > > - > > static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) > > { > > if (data->rq_flags & RQF_SCHED_TAGS) > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > > index 01e8c31db665..c921ae5236ab 100644 > > --- a/include/linux/blk-mq.h > > +++ b/include/linux/blk-mq.h > > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, > > > > bool blk_mq_queue_inflight(struct request_queue *q); > > > > +#define queue_for_each_hw_ctx(q, hctx, i) \ > > + xa_for_each(&(q)->hctx_table, (i), (hctx)) > > + > > +#define hctx_for_each_ctx(hctx, ctx, i) \ > > + for ((i) = 0; (i) < (hctx)->nr_ctx && \ > > + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > > + > > +static inline bool blk_mq_is_shared_tags(unsigned int flags) > > +{ > > + return flags & BLK_MQ_F_TAG_HCTX_SHARED; > > +} > > + > > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > > +{ > > + unsigned int nr_active = 0; > > + struct blk_mq_hw_ctx *hctx; > > + unsigned long i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > > + return atomic_read(&q->nr_active_requests_shared_tags); > > + nr_active += atomic_read(&hctx->nr_active); > > + } > > + return nr_active; > > +} > > + > > enum { > > /* return when out of requests */ > > BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0), > > @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq) > > return rq + 1; > > } > > > > -#define queue_for_each_hw_ctx(q, hctx, i) \ > > - xa_for_each(&(q)->hctx_table, (i), (hctx)) > > - > > -#define hctx_for_each_ctx(hctx, ctx, i) \ > > - for ((i) = 0; (i) < (hctx)->nr_ctx && \ > > - ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > > - > > static inline void blk_mq_cleanup_rq(struct request *rq) > > { > > if (rq->q->mq_ops->cleanup_rq) > > Well. As discussed, using xarray on 'small' arrays is horrible for > performance. We really should revert the patch from Ming to > turn it back into a simple array; that'll make traversing much faster. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew > Myers, Andrew McDonald, Martje Boudien Moerman > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 13:34 ` Ewan Milne @ 2023-10-03 20:11 ` Uday Shankar 2023-10-04 9:19 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Uday Shankar @ 2023-10-03 20:11 UTC (permalink / raw) To: Ewan Milne Cc: Hannes Reinecke, linux-nvme, tsong, jmeneghi, mlombard, Thomas Song, Randy Jennings, Prabhath Sajeepa Hi Ewan, > For NVMe/TCP, at least on the system I used that has 80 cores, there > were only 1 hctxs This sounds like a target limitation and/or an effect of the command used to create the connection. The host driver will respect any limits imposed by the target or the command creating the connection, but will otherwise prefer to create one I/O queue per CPU (see nvmf_nr_io_queues, nvme_set_queue_count, and nvme_fc_create_io_queues). This is the suggested behavior from the spec as well (section 2.1): > For example, on a four core processor based system, there may be a > queue pair per core to avoid locking and ensure data structures are > created in the appropriate processor core’s cache. Targets aiming to provide high performance should thus allow for the creation of more than one I/O queue. This is the case with the in-kernel target (the default queue count limit is NVMET_NR_QUEUES, or 128) as well as Pure's Flasharray. Testing against such a target, the approach of looping through all hctxs and summing the nr_active on every I/O comes with a very significant latency hit. I used fio to run benchmarks (jobfile attached) and the in-kernel NVMe-TCP target, whose generous limit of 128 allows the host to create one queue per CPU, for 64 queues total. I used a kernel built with two queue-depth path selectors (patch against Linux 6.5 attached) - one is the one you posted, and the other is the atomics-based one that Thomas wrote (with a small change so that path selectors not using the atomic do not have to pay its cost). Here are the submission latency statistics with the looping-over-hctx approach (the full fio output is also attached): slat (usec): min=70, max=598, avg=78.35, stdev= 7.76 And with atomics: slat (usec): min=24, max=491, avg=32.51, stdev= 6.47 Note the ~46us difference in the average submission latency. I used ftrace to measure the amount of time spent in the path selector functions, and the results confirm that essentially all of this additional latency is due to the path selector. For the hctx looping approach: # grep 'nvme_mpath_start_request' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length 0.20338877564886418 # grep 'nvme_mpath_end_request' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length 0.5282572360375025 # grep 'nvme_find_path' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length 46.14676273950285 vs atomics: # grep 'nvme_mpath_start_request' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length 0.20764315633022504 # grep 'nvme_mpath_end_request' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length 0.5173406296446426 # grep 'nvme_find_path' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length 0.4562795568166467 I don't know if this is due to the "using xarray on 'small' arrays is horrible for performance" that Hannes mentioned. Maybe reverting that patch would help things, but I still prefer the atomics approach for its simplicity and the fact that the data does not indicate that the two new RMW ops per I/O are a source of issues. If the contention is still considered a problem, we can "split" the atomic into pieces along: - namespace boundaries, so that the atomic lives in struct nvme_ns instead of struct nvme_ctrl. if different CPUs are doing I/O to different namespaces (which may be a common access pattern), this will reduce the contention on the atomic. this would give us pretty much a 1:1 translation of the queue-length path selector from dm. - hctx boundaries - when we calculate the queue depth, instead of summing the depths for every path's hctxs, only consider the hctx associated to the local CPU. - CPU boundaries - make the queue depth variable fully per-CPU. this should be equivalent to the previous approach in the case where the host can create one hctx per CPU. Of course, all of these come with the disadvantage of sharing less information about path quality. But that's what it boils down to - sharing of rapidly-changing information between CPUs is expensive. We can also consider "standard" optimizations around the atomic(s), like ensuring each one gets its own cache line. --- fio jobfile: [global] ioengine=libaio randrepeat=1 direct=1 group_reporting blocksize=512k iodepth=1 numjobs=1 runtime=15 time_based rw=randread [JOB1] cpus_allowed=1 filename=/dev/nvme4n1 full fio results with hctx-looping approach: JOB1: (g=0): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1 fio-3.19 Starting 1 process JOB1: (groupid=0, jobs=1): err= 0: pid=185197: Mon Oct 2 15:22:39 2023 read: IOPS=490, BW=245MiB/s (257MB/s)(3680MiB/15001msec) slat (usec): min=70, max=598, avg=78.35, stdev= 7.76 clat (usec): min=1662, max=2743, avg=1958.35, stdev=135.18 lat (usec): min=1739, max=2901, avg=2036.92, stdev=135.45 clat percentiles (usec): | 1.00th=[ 1729], 5.00th=[ 1778], 10.00th=[ 1811], 20.00th=[ 1844], | 30.00th=[ 1876], 40.00th=[ 1909], 50.00th=[ 1942], 60.00th=[ 1975], | 70.00th=[ 2008], 80.00th=[ 2057], 90.00th=[ 2147], 95.00th=[ 2212], | 99.00th=[ 2376], 99.50th=[ 2442], 99.90th=[ 2540], 99.95th=[ 2671], | 99.99th=[ 2737] bw ( KiB/s): min=248832, max=254976, per=100.00%, avg=251490.55, stdev=1468.84, samples=29 iops : min= 486, max= 498, avg=491.17, stdev= 2.90, samples=29 lat (msec) : 2=68.68%, 4=31.32% cpu : usr=0.19%, sys=4.08%, ctx=7359, majf=0, minf=138 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=7359,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=245MiB/s (257MB/s), 245MiB/s-245MiB/s (257MB/s-257MB/s), io=3680MiB (3858MB), run=15001-15001msec Disk stats (read/write): nvme4n1: ios=7274/0, merge=0/0, ticks=14134/0, in_queue=14134, util=99.44% full fio results with atomics approach: JOB1: (g=0): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1 fio-3.19 Starting 1 process JOB1: (groupid=0, jobs=1): err= 0: pid=185287: Mon Oct 2 15:23:03 2023 read: IOPS=493, BW=247MiB/s (259MB/s)(3701MiB/15002msec) slat (usec): min=24, max=491, avg=32.51, stdev= 6.47 clat (usec): min=1648, max=24772, avg=1992.77, stdev=329.21 lat (usec): min=1679, max=24804, avg=2025.49, stdev=329.31 clat percentiles (usec): | 1.00th=[ 1745], 5.00th=[ 1795], 10.00th=[ 1827], 20.00th=[ 1876], | 30.00th=[ 1909], 40.00th=[ 1942], 50.00th=[ 1975], 60.00th=[ 2008], | 70.00th=[ 2040], 80.00th=[ 2089], 90.00th=[ 2180], 95.00th=[ 2245], | 99.00th=[ 2409], 99.50th=[ 2474], 99.90th=[ 2606], 99.95th=[ 2868], | 99.99th=[24773] bw ( KiB/s): min=235520, max=260096, per=100.00%, avg=252927.10, stdev=4438.49, samples=29 iops : min= 460, max= 508, avg=493.97, stdev= 8.68, samples=29 lat (msec) : 2=60.26%, 4=39.70%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=0.15%, sys=1.91%, ctx=7401, majf=0, minf=139 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=7401,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=247MiB/s (259MB/s), 247MiB/s-247MiB/s (259MB/s-259MB/s), io=3701MiB (3880MB), run=15002-15002msec Disk stats (read/write): nvme4n1: ios=7316/0, merge=0/0, ticks=14470/0, in_queue=14470, util=99.24% patch against Linux 6.5 used for testing. this needs polish and is buggy when the iopolicy changes while I/O is outstanding, but I avoided hitting the bug during experiments by not doing I/O while changing the iopolicy: diff --git a/block/blk-mq.h b/block/blk-mq.h index 1743857e0b01..fbc65eefa017 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, return tag < tags->nr_reserved_tags; } -static inline bool blk_mq_is_shared_tags(unsigned int flags) -{ - return flags & BLK_MQ_F_TAG_HCTX_SHARED; -} - static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) { if (data->rq_flags & RQF_SCHED_TAGS) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 0a88d7bdc5e3..f71aa1e6f537 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -17,6 +17,8 @@ MODULE_PARM_DESC(multipath, static const char *nvme_iopolicy_names[] = { [NVME_IOPOLICY_NUMA] = "numa", [NVME_IOPOLICY_RR] = "round-robin", + [NVME_IOPOLICY_QD_ATOMIC] = "queue-depth-atomic", + [NVME_IOPOLICY_QD_EWAN] = "queue-depth-ewan", }; static int iopolicy = NVME_IOPOLICY_NUMA; @@ -29,6 +31,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp) iopolicy = NVME_IOPOLICY_NUMA; else if (!strncmp(val, "round-robin", 11)) iopolicy = NVME_IOPOLICY_RR; + else if (!strncmp(val, "queue-depth-atomic", 18)) + iopolicy = NVME_IOPOLICY_QD_ATOMIC; + else if (!strncmp(val, "queue-depth-ewan", 16)) + iopolicy = NVME_IOPOLICY_QD_EWAN; else return -EINVAL; @@ -126,6 +132,11 @@ void nvme_mpath_start_request(struct request *rq) { struct nvme_ns *ns = rq->q->queuedata; struct gendisk *disk = ns->head->disk; + struct nvme_subsystem *subsys = ns->ctrl->subsys; + + if (READ_ONCE(subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) { + atomic_inc(&ns->ctrl->nr_active); + } if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq)) return; @@ -139,9 +150,15 @@ EXPORT_SYMBOL_GPL(nvme_mpath_start_request); void nvme_mpath_end_request(struct request *rq) { struct nvme_ns *ns = rq->q->queuedata; + struct nvme_subsystem *subsys = ns->ctrl->subsys; + + if (READ_ONCE(subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) { + atomic_dec(&ns->ctrl->nr_active); + } if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS)) return; + bdev_end_io_acct(ns->head->disk->part0, req_op(rq), blk_rq_bytes(rq) >> SECTOR_SHIFT, nvme_req(rq)->start_time); @@ -329,13 +346,69 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, return found; } +static struct nvme_ns *nvme_queue_depth_atomic_path(struct nvme_ns_head *head) +{ + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; + unsigned int depth; + + list_for_each_entry_rcu(ns, &head->list, siblings) { + if (nvme_path_is_disabled(ns)) + continue; + + depth = atomic_read(&ns->ctrl->nr_active); + + if (ns->ana_state == NVME_ANA_OPTIMIZED && + depth < min_depth_opt) { + min_depth_opt = depth; + best_opt = ns; + } + + if (ns->ana_state == NVME_ANA_NONOPTIMIZED && + depth < min_depth_nonopt) { + min_depth_nonopt = depth; + best_nonopt = ns; + } + } + + return best_opt ? best_opt : best_nonopt; +} + +static struct nvme_ns *nvme_queue_depth_ewan_path(struct nvme_ns_head *head) +{ + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; + unsigned int depth; + + list_for_each_entry_rcu(ns, &head->list, siblings) { + if (nvme_path_is_disabled(ns)) + continue; + + depth = blk_mq_queue_nr_active(ns->queue); + + if (ns->ana_state == NVME_ANA_OPTIMIZED && + depth < min_depth_opt) { + min_depth_opt = depth; + best_opt = ns; + } + + if (ns->ana_state == NVME_ANA_NONOPTIMIZED && + depth < min_depth_nonopt) { + min_depth_nonopt = depth; + best_nonopt = ns; + } + } + + return best_opt ? best_opt : best_nonopt; +} + static inline bool nvme_path_is_optimized(struct nvme_ns *ns) { return ns->ctrl->state == NVME_CTRL_LIVE && ns->ana_state == NVME_ANA_OPTIMIZED; } -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) +noinline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) { int node = numa_node_id(); struct nvme_ns *ns; @@ -346,6 +419,11 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR) return nvme_round_robin_path(head, node, ns); + else if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) + return nvme_queue_depth_atomic_path(head); + else if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_QD_EWAN) + return nvme_queue_depth_ewan_path(head); + if (unlikely(!nvme_path_is_optimized(ns))) return __nvme_find_path(head, node); return ns; @@ -900,6 +978,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl) { + atomic_set(&ctrl->nr_active, 0); mutex_init(&ctrl->ana_lock); timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0); INIT_WORK(&ctrl->ana_work, nvme_ana_work); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index f35647c470af..a2e5d09d697d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -342,6 +342,7 @@ struct nvme_ctrl { u8 anatt; u32 anagrpmax; u32 nanagrpid; + atomic_t nr_active; struct mutex ana_lock; struct nvme_ana_rsp_hdr *ana_log_buf; size_t ana_log_size; @@ -389,6 +390,8 @@ struct nvme_ctrl { enum nvme_iopolicy { NVME_IOPOLICY_NUMA, NVME_IOPOLICY_RR, + NVME_IOPOLICY_QD_ATOMIC, + NVME_IOPOLICY_QD_EWAN, }; struct nvme_subsystem { diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 495ca198775f..a7a32a1fd16c 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -718,6 +718,33 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, bool blk_mq_queue_inflight(struct request_queue *q); +#define queue_for_each_hw_ctx(q, hctx, i) \ + xa_for_each(&(q)->hctx_table, (i), (hctx)) + +#define hctx_for_each_ctx(hctx, ctx, i) \ + for ((i) = 0; (i) < (hctx)->nr_ctx && \ + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) + +static inline bool blk_mq_is_shared_tags(unsigned int flags) +{ + return flags & BLK_MQ_F_TAG_HCTX_SHARED; +} + +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) +{ + unsigned int nr_active = 0; + struct blk_mq_hw_ctx *hctx; + unsigned long i; + + queue_for_each_hw_ctx(q, hctx, i) { + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) + return atomic_read(&q->nr_active_requests_shared_tags); + nr_active += atomic_read(&hctx->nr_active); + } + return nr_active; +} + + enum { /* return when out of requests */ BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0), @@ -943,13 +970,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq) return rq + 1; } -#define queue_for_each_hw_ctx(q, hctx, i) \ - xa_for_each(&(q)->hctx_table, (i), (hctx)) - -#define hctx_for_each_ctx(hctx, ctx, i) \ - for ((i) = 0; (i) < (hctx)->nr_ctx && \ - ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) - static inline void blk_mq_cleanup_rq(struct request *rq) { if (rq->q->mq_ops->cleanup_rq) ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-10-03 20:11 ` Uday Shankar @ 2023-10-04 9:19 ` Sagi Grimberg 0 siblings, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2023-10-04 9:19 UTC (permalink / raw) To: Uday Shankar, Ewan Milne Cc: Hannes Reinecke, linux-nvme, tsong, jmeneghi, mlombard, Thomas Song, Randy Jennings, Prabhath Sajeepa > I don't know if this is due to the "using xarray on 'small' arrays is > horrible for performance" that Hannes mentioned. Maybe reverting that > patch would help things, but I still prefer the atomics approach for its > simplicity and the fact that the data does not indicate that the two new > RMW ops per I/O are a source of issues. If the contention is still > considered a problem, we can "split" the atomic into pieces along: > > - namespace boundaries, so that the atomic lives in struct nvme_ns > instead of struct nvme_ctrl. if different CPUs are doing I/O to > different namespaces (which may be a common access pattern), this will > reduce the contention on the atomic. this would give us pretty much a > 1:1 translation of the queue-length path selector from dm. > - hctx boundaries - when we calculate the queue depth, instead of > summing the depths for every path's hctxs, only consider the hctx > associated to the local CPU. I think that this is the best approach. It is matching the existing synchronization boundary in the io-path today. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne 2023-09-25 20:56 ` Bart Van Assche 2023-09-27 7:36 ` Hannes Reinecke @ 2023-09-27 9:49 ` Ming Lei 2023-09-27 13:54 ` Ewan Milne 2023-09-27 10:56 ` Sagi Grimberg 3 siblings, 1 reply; 24+ messages in thread From: Ming Lei @ 2023-09-27 9:49 UTC (permalink / raw) To: Ewan D. Milne; +Cc: linux-nvme, tsong, jmeneghi, mlombard On Mon, Sep 25, 2023 at 12:31:21PM -0400, Ewan D. Milne wrote: > Returns a count of the total number of active requests > in a queue. For non-shared tags (the usual case) this is > the sum of nr_active from all of the hctxs. hctx->nr_active is only updated if BLK_MQ_F_TAG_QUEUE_SHARED is true and blk_mq_is_shared_tags() is false, so hctx->nr_active isn't available if BLK_MQ_F_TAG_QUEUE_SHARED is false(the usual case?). Thanks, Ming ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 9:49 ` Ming Lei @ 2023-09-27 13:54 ` Ewan Milne 0 siblings, 0 replies; 24+ messages in thread From: Ewan Milne @ 2023-09-27 13:54 UTC (permalink / raw) To: Ming Lei; +Cc: linux-nvme, tsong, jmeneghi, mlombard I instrumented the code to look at nr_active and it did seem to be updated for all the nvme fabrics transports I tested, but I'll check. (As a blk_ .. function it would have to work on anything). Of course the patches depend upon the block layer having the nr_active count available somewhere. I did it this way to avoid having to duplicate the count in the nvme code. -Ewan On Wed, Sep 27, 2023 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote: > > On Mon, Sep 25, 2023 at 12:31:21PM -0400, Ewan D. Milne wrote: > > Returns a count of the total number of active requests > > in a queue. For non-shared tags (the usual case) this is > > the sum of nr_active from all of the hctxs. > > hctx->nr_active is only updated if BLK_MQ_F_TAG_QUEUE_SHARED is true > and blk_mq_is_shared_tags() is false, so hctx->nr_active isn't > available if BLK_MQ_F_TAG_QUEUE_SHARED is false(the usual case?). > > > Thanks, > Ming > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne ` (2 preceding siblings ...) 2023-09-27 9:49 ` Ming Lei @ 2023-09-27 10:56 ` Sagi Grimberg 2023-09-27 13:50 ` Ewan Milne 3 siblings, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2023-09-27 10:56 UTC (permalink / raw) To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard On 9/25/23 19:31, Ewan D. Milne wrote: > Returns a count of the total number of active requests > in a queue. For non-shared tags (the usual case) this is > the sum of nr_active from all of the hctxs. > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > --- > block/blk-mq.h | 5 ----- > include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++------- > 2 files changed, 26 insertions(+), 12 deletions(-) > > diff --git a/block/blk-mq.h b/block/blk-mq.h > index 1743857e0b01..fbc65eefa017 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, > return tag < tags->nr_reserved_tags; > } > > -static inline bool blk_mq_is_shared_tags(unsigned int flags) > -{ > - return flags & BLK_MQ_F_TAG_HCTX_SHARED; > -} > - > static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) > { > if (data->rq_flags & RQF_SCHED_TAGS) > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index 01e8c31db665..c921ae5236ab 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, > > bool blk_mq_queue_inflight(struct request_queue *q); > > +#define queue_for_each_hw_ctx(q, hctx, i) \ > + xa_for_each(&(q)->hctx_table, (i), (hctx)) > + > +#define hctx_for_each_ctx(hctx, ctx, i) \ > + for ((i) = 0; (i) < (hctx)->nr_ctx && \ > + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > + > +static inline bool blk_mq_is_shared_tags(unsigned int flags) > +{ > + return flags & BLK_MQ_F_TAG_HCTX_SHARED; > +} > + > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > +{ > + unsigned int nr_active = 0; > + struct blk_mq_hw_ctx *hctx; > + unsigned long i; > + > + queue_for_each_hw_ctx(q, hctx, i) { > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > + return atomic_read(&q->nr_active_requests_shared_tags); > + nr_active += atomic_read(&hctx->nr_active); > + } I think that for the purposes of nvme-mpath you should probably be interested in the hctx mapped to the running cpu, and not the cumulative active requests. As for BLK_MQ_F_TAG_HCTX_SHARED, this is dead-code until scsi/null makes any use of it... but seems fine in theory. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 10:56 ` Sagi Grimberg @ 2023-09-27 13:50 ` Ewan Milne 2023-09-28 10:56 ` Sagi Grimberg 0 siblings, 1 reply; 24+ messages in thread From: Ewan Milne @ 2023-09-27 13:50 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, tsong, jmeneghi, mlombard I think it is unfortunately necessary to compute the sum from all the hctxs because in the general case there could be threads on other CPUs submitting I/O through another hctx that might significantly affect the result, but I can do some more tests and see how it looks. -Ewan On Wed, Sep 27, 2023 at 6:56 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > On 9/25/23 19:31, Ewan D. Milne wrote: > > Returns a count of the total number of active requests > > in a queue. For non-shared tags (the usual case) this is > > the sum of nr_active from all of the hctxs. > > > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > > --- > > block/blk-mq.h | 5 ----- > > include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++------- > > 2 files changed, 26 insertions(+), 12 deletions(-) > > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > index 1743857e0b01..fbc65eefa017 100644 > > --- a/block/blk-mq.h > > +++ b/block/blk-mq.h > > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, > > return tag < tags->nr_reserved_tags; > > } > > > > -static inline bool blk_mq_is_shared_tags(unsigned int flags) > > -{ > > - return flags & BLK_MQ_F_TAG_HCTX_SHARED; > > -} > > - > > static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) > > { > > if (data->rq_flags & RQF_SCHED_TAGS) > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > > index 01e8c31db665..c921ae5236ab 100644 > > --- a/include/linux/blk-mq.h > > +++ b/include/linux/blk-mq.h > > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, > > > > bool blk_mq_queue_inflight(struct request_queue *q); > > > > +#define queue_for_each_hw_ctx(q, hctx, i) \ > > + xa_for_each(&(q)->hctx_table, (i), (hctx)) > > + > > +#define hctx_for_each_ctx(hctx, ctx, i) \ > > + for ((i) = 0; (i) < (hctx)->nr_ctx && \ > > + ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++) > > + > > +static inline bool blk_mq_is_shared_tags(unsigned int flags) > > +{ > > + return flags & BLK_MQ_F_TAG_HCTX_SHARED; > > +} > > + > > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q) > > +{ > > + unsigned int nr_active = 0; > > + struct blk_mq_hw_ctx *hctx; > > + unsigned long i; > > + > > + queue_for_each_hw_ctx(q, hctx, i) { > > + if (unlikely(blk_mq_is_shared_tags(hctx->flags))) > > + return atomic_read(&q->nr_active_requests_shared_tags); > > + nr_active += atomic_read(&hctx->nr_active); > > + } > > I think that for the purposes of nvme-mpath you should probably be > interested in the hctx mapped to the running cpu, and not the > cumulative active requests. > > As for BLK_MQ_F_TAG_HCTX_SHARED, this is dead-code until scsi/null makes > any use of it... but seems fine in theory. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] block: introduce blk_queue_nr_active() 2023-09-27 13:50 ` Ewan Milne @ 2023-09-28 10:56 ` Sagi Grimberg 0 siblings, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2023-09-28 10:56 UTC (permalink / raw) To: Ewan Milne; +Cc: linux-nvme, tsong, jmeneghi, mlombard > I think it is unfortunately necessary to compute the sum from all the hctxs > because in the general case there could be threads on other CPUs submitting > I/O through another hctx that might significantly affect the result, This argument is true in general, other threads issuing I/O on different contexts may affect each other as you never know how a controller processes the submission queues. Its a tradeoff between the level of inspection and a reasonable overhead. I do not think iterating over all hctxs of all available paths is reasonable. But lets wait for your testing. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" 2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne @ 2023-09-25 16:31 ` Ewan D. Milne 2023-09-27 7:38 ` Hannes Reinecke 2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne 2 siblings, 1 reply; 24+ messages in thread From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw) To: linux-nvme; +Cc: tsong, jmeneghi, mlombard The existing iopolicies are inefficient in some cases, such as the presence of a path with high latency. The round-robin policy would use that path equally with faster paths, which results in sub-optimal performance. The queue-depth policy instead sends I/O requests down the path with the least amount of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests in their queues compared to "bad" paths. The aim is to use those paths the most to bring down overall latency. [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace and compilation warnings, updated MODULE_PARM description and changed to use new blk_mq_queue_nr_active() instead of adding additional atomic counters.] Co-developed-by: Thomas Song <tsong@purestorage.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> --- drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++-- drivers/nvme/host/nvme.h | 1 + 2 files changed, 47 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 0a88d7bdc5e3..deea7fd4aa95 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath, static const char *nvme_iopolicy_names[] = { [NVME_IOPOLICY_NUMA] = "numa", [NVME_IOPOLICY_RR] = "round-robin", + [NVME_IOPOLICY_QD] = "queue-depth", }; static int iopolicy = NVME_IOPOLICY_NUMA; @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp) iopolicy = NVME_IOPOLICY_NUMA; else if (!strncmp(val, "round-robin", 11)) iopolicy = NVME_IOPOLICY_RR; + else if (!strncmp(val, "queue-depth", 11)) + iopolicy = NVME_IOPOLICY_QD; else return -EINVAL; @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp) module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, &iopolicy, 0644); MODULE_PARM_DESC(iopolicy, - "Default multipath I/O policy; 'numa' (default) or 'round-robin'"); + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'"); void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) { @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, return found; } +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head, + int node) +{ + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; + unsigned int depth; + + list_for_each_entry_rcu(ns, &head->list, siblings) { + if (nvme_path_is_disabled(ns)) + continue; + + depth = blk_mq_queue_nr_active(ns->queue); + + switch (ns->ana_state) { + case NVME_ANA_OPTIMIZED: + if (depth < min_depth_opt) { + min_depth_opt = depth; + best_opt = ns; + } + break; + + case NVME_ANA_NONOPTIMIZED: + if (depth < min_depth_nonopt) { + min_depth_nonopt = depth; + best_nonopt = ns; + } + break; + default: + break; + } + } + + return best_opt ? best_opt : best_nonopt; +} + static inline bool nvme_path_is_optimized(struct nvme_ns *ns) { return ns->ctrl->state == NVME_CTRL_LIVE && @@ -344,8 +382,14 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) if (unlikely(!ns)) return __nvme_find_path(head, node); - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR) + switch(READ_ONCE(head->subsys->iopolicy)) { + case NVME_IOPOLICY_RR: return nvme_round_robin_path(head, node, ns); + + case NVME_IOPOLICY_QD: + return nvme_queue_depth_path(head, node); + } + if (unlikely(!nvme_path_is_optimized(ns))) return __nvme_find_path(head, node); return ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index f35647c470af..938fbed007ce 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -389,6 +389,7 @@ struct nvme_ctrl { enum nvme_iopolicy { NVME_IOPOLICY_NUMA, NVME_IOPOLICY_RR, + NVME_IOPOLICY_QD, }; struct nvme_subsystem { -- 2.20.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" 2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne @ 2023-09-27 7:38 ` Hannes Reinecke 2023-09-27 11:02 ` Sagi Grimberg 2023-09-27 13:42 ` Ewan Milne 0 siblings, 2 replies; 24+ messages in thread From: Hannes Reinecke @ 2023-09-27 7:38 UTC (permalink / raw) To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard On 9/25/23 18:31, Ewan D. Milne wrote: > The existing iopolicies are inefficient in some cases, such as > the presence of a path with high latency. The round-robin > policy would use that path equally with faster paths, which > results in sub-optimal performance. > > The queue-depth policy instead sends I/O requests down the path > with the least amount of requests in its request queue. Paths > with lower latency will clear requests more quickly and have less > requests in their queues compared to "bad" paths. The aim is to > use those paths the most to bring down overall latency. > > [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace > and compilation warnings, updated MODULE_PARM description > and changed to use new blk_mq_queue_nr_active() instead of adding > additional atomic counters.] > > Co-developed-by: Thomas Song <tsong@purestorage.com> > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > --- > drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++-- > drivers/nvme/host/nvme.h | 1 + > 2 files changed, 47 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 0a88d7bdc5e3..deea7fd4aa95 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath, > static const char *nvme_iopolicy_names[] = { > [NVME_IOPOLICY_NUMA] = "numa", > [NVME_IOPOLICY_RR] = "round-robin", > + [NVME_IOPOLICY_QD] = "queue-depth", > }; > > static int iopolicy = NVME_IOPOLICY_NUMA; > @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp) > iopolicy = NVME_IOPOLICY_NUMA; > else if (!strncmp(val, "round-robin", 11)) > iopolicy = NVME_IOPOLICY_RR; > + else if (!strncmp(val, "queue-depth", 11)) > + iopolicy = NVME_IOPOLICY_QD; > else > return -EINVAL; > > @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp) > module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, > &iopolicy, 0644); > MODULE_PARM_DESC(iopolicy, > - "Default multipath I/O policy; 'numa' (default) or 'round-robin'"); > + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'"); > > void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) > { > @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, > return found; > } > > +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head, > + int node) > +{ > + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; > + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; > + unsigned int depth; > + > + list_for_each_entry_rcu(ns, &head->list, siblings) { > + if (nvme_path_is_disabled(ns)) > + continue; > + > + depth = blk_mq_queue_nr_active(ns->queue); That would require an 'atomic_read' on each path for each I/O. Which absolutely sucks for performance. Considering that we really are only interested in a rough idea about path usage, wouldn't it be better to use READ_ONCE()/WRITE_ONCE() for accumulating the nr of requests per queue? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" 2023-09-27 7:38 ` Hannes Reinecke @ 2023-09-27 11:02 ` Sagi Grimberg 2023-09-27 13:42 ` Ewan Milne 1 sibling, 0 replies; 24+ messages in thread From: Sagi Grimberg @ 2023-09-27 11:02 UTC (permalink / raw) To: Hannes Reinecke, Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard >> The existing iopolicies are inefficient in some cases, such as >> the presence of a path with high latency. The round-robin >> policy would use that path equally with faster paths, which >> results in sub-optimal performance. >> >> The queue-depth policy instead sends I/O requests down the path >> with the least amount of requests in its request queue. Paths >> with lower latency will clear requests more quickly and have less >> requests in their queues compared to "bad" paths. The aim is to >> use those paths the most to bring down overall latency. >> >> [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace >> and compilation warnings, updated MODULE_PARM description >> and changed to use new blk_mq_queue_nr_active() instead of adding >> additional atomic counters.] >> >> Co-developed-by: Thomas Song <tsong@purestorage.com> >> Signed-off-by: Ewan D. Milne <emilne@redhat.com> >> --- >> drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++-- >> drivers/nvme/host/nvme.h | 1 + >> 2 files changed, 47 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/multipath.c >> b/drivers/nvme/host/multipath.c >> index 0a88d7bdc5e3..deea7fd4aa95 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath, >> static const char *nvme_iopolicy_names[] = { >> [NVME_IOPOLICY_NUMA] = "numa", >> [NVME_IOPOLICY_RR] = "round-robin", >> + [NVME_IOPOLICY_QD] = "queue-depth", >> }; >> static int iopolicy = NVME_IOPOLICY_NUMA; >> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const >> struct kernel_param *kp) >> iopolicy = NVME_IOPOLICY_NUMA; >> else if (!strncmp(val, "round-robin", 11)) >> iopolicy = NVME_IOPOLICY_RR; >> + else if (!strncmp(val, "queue-depth", 11)) >> + iopolicy = NVME_IOPOLICY_QD; >> else >> return -EINVAL; >> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct >> kernel_param *kp) >> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, >> &iopolicy, 0644); >> MODULE_PARM_DESC(iopolicy, >> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'"); >> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' >> or 'queue-depth'"); >> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) >> { >> @@ -329,6 +332,41 @@ static struct nvme_ns >> *nvme_round_robin_path(struct nvme_ns_head *head, >> return found; >> } >> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head, >> + int node) >> +{ >> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; >> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; >> + unsigned int depth; >> + >> + list_for_each_entry_rcu(ns, &head->list, siblings) { >> + if (nvme_path_is_disabled(ns)) >> + continue; >> + >> + depth = blk_mq_queue_nr_active(ns->queue); > > That would require an 'atomic_read' on each path for each I/O. > Which absolutely sucks for performance. Actually in the current implementation its for every hctx in every path... > Considering that we really are only interested in a rough idea about > path usage, I am not sure that a "rough idea" is sufficient here. > wouldn't it be better to use READ_ONCE()/WRITE_ONCE() > for accumulating the nr of requests per queue? I'm assuming that WRITE_ONCE() is a slip up because the path selector should absolutely not update the hctx active count. I think its unnatural to access atomic variables non-atomically, and at the very least requires a code comment to why this is ok. But regardless we need some testing to tell us if this is OK to do. This only matters when there are less hctx than running cpus, but that is a probable case for fabrics. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" 2023-09-27 7:38 ` Hannes Reinecke 2023-09-27 11:02 ` Sagi Grimberg @ 2023-09-27 13:42 ` Ewan Milne 1 sibling, 0 replies; 24+ messages in thread From: Ewan Milne @ 2023-09-27 13:42 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-nvme, tsong, jmeneghi, mlombard Obviously having to look at the number of active requests on all paths for each I/O has a performance cost, but the tradeoff is that you pay a CPU cost in order to get better path distribution/utilization and hopefully lower I/O latency. And, you only pay this cost if queue-depth is used. Otherwise the patches barely touch the code path. I am not an expert in modern CPU cache architecture, but the updates to the atomic counters on the request/completion are all per-hctx, the path selector only does a read, which I think is a simple (no LOCK prefix) read on x86. The cost would be in the CPU touching the cacheline. I tried some alternatives that did not always get the most current value of nr_active for *each* I/O but didn't get suitable behavior. I/O requests can be submitted very quickly, a lazy computation may not adapt fast enough. -Ewan On Wed, Sep 27, 2023 at 3:38 AM Hannes Reinecke <hare@suse.de> wrote: > > On 9/25/23 18:31, Ewan D. Milne wrote: > > The existing iopolicies are inefficient in some cases, such as > > the presence of a path with high latency. The round-robin > > policy would use that path equally with faster paths, which > > results in sub-optimal performance. > > > > The queue-depth policy instead sends I/O requests down the path > > with the least amount of requests in its request queue. Paths > > with lower latency will clear requests more quickly and have less > > requests in their queues compared to "bad" paths. The aim is to > > use those paths the most to bring down overall latency. > > > > [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace > > and compilation warnings, updated MODULE_PARM description > > and changed to use new blk_mq_queue_nr_active() instead of adding > > additional atomic counters.] > > > > Co-developed-by: Thomas Song <tsong@purestorage.com> > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > > --- > > drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++-- > > drivers/nvme/host/nvme.h | 1 + > > 2 files changed, 47 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > > index 0a88d7bdc5e3..deea7fd4aa95 100644 > > --- a/drivers/nvme/host/multipath.c > > +++ b/drivers/nvme/host/multipath.c > > @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath, > > static const char *nvme_iopolicy_names[] = { > > [NVME_IOPOLICY_NUMA] = "numa", > > [NVME_IOPOLICY_RR] = "round-robin", > > + [NVME_IOPOLICY_QD] = "queue-depth", > > }; > > > > static int iopolicy = NVME_IOPOLICY_NUMA; > > @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp) > > iopolicy = NVME_IOPOLICY_NUMA; > > else if (!strncmp(val, "round-robin", 11)) > > iopolicy = NVME_IOPOLICY_RR; > > + else if (!strncmp(val, "queue-depth", 11)) > > + iopolicy = NVME_IOPOLICY_QD; > > else > > return -EINVAL; > > > > @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp) > > module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, > > &iopolicy, 0644); > > MODULE_PARM_DESC(iopolicy, > > - "Default multipath I/O policy; 'numa' (default) or 'round-robin'"); > > + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'"); > > > > void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) > > { > > @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, > > return found; > > } > > > > +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head, > > + int node) > > +{ > > + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; > > + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; > > + unsigned int depth; > > + > > + list_for_each_entry_rcu(ns, &head->list, siblings) { > > + if (nvme_path_is_disabled(ns)) > > + continue; > > + > > + depth = blk_mq_queue_nr_active(ns->queue); > > That would require an 'atomic_read' on each path for each I/O. > Which absolutely sucks for performance. > > Considering that we really are only interested in a rough idea about > path usage, wouldn't it be better to use READ_ONCE()/WRITE_ONCE() > for accumulating the nr of requests per queue? > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew > Myers, Andrew McDonald, Martje Boudien Moerman > ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option 2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne 2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne @ 2023-09-25 16:31 ` Ewan D. Milne 2023-09-27 7:41 ` Hannes Reinecke 2 siblings, 1 reply; 24+ messages in thread From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw) To: linux-nvme; +Cc: tsong, jmeneghi, mlombard Setting nvme_core.use_nonoptimized=true will cause the path selector to treat optimized and nonoptimized paths equally. This is because although an NVMe fabrics target device may report an unoptimized ANA state, it is possible that other factors such as fabric latency are a large factor in the I/O service time. And, throughput may improve overall if nonoptimized ports are also used. Signed-off-by: Ewan D. Milne <emilne@redhat.com> --- drivers/nvme/host/multipath.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index deea7fd4aa95..f640b8ad934b 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -13,6 +13,10 @@ bool multipath = true; module_param(multipath, bool, 0444); MODULE_PARM_DESC(multipath, "turn on native support for multiple controllers per subsystem"); +bool use_nonoptimized = false; +module_param(use_nonoptimized, bool, 0644); +MODULE_PARM_DESC(use_nonoptimized, + "Use ANA nonoptimized paths as well as optimized paths for I/O"); static const char *nvme_iopolicy_names[] = { [NVME_IOPOLICY_NUMA] = "numa", @@ -272,7 +276,10 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node) } } - if (!found) + /* + * If fallback_distance was set, fallback must also be set + */ + if (!found || (use_nonoptimized && (fallback_distance < found_distance))) found = fallback; if (found) rcu_assign_pointer(head->current_path[node], found); @@ -310,8 +317,12 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, found = ns; goto out; } - if (ns->ana_state == NVME_ANA_NONOPTIMIZED) + if (ns->ana_state == NVME_ANA_NONOPTIMIZED) { found = ns; + if (use_nonoptimized) + goto out; + } + } /* @@ -321,7 +332,7 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, * - no other usable path found and current is usable. */ if (!nvme_path_is_disabled(old) && - (old->ana_state == NVME_ANA_OPTIMIZED || + (old->ana_state == NVME_ANA_OPTIMIZED || use_nonoptimized || (!found && old->ana_state == NVME_ANA_NONOPTIMIZED))) return old; @@ -364,6 +375,11 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head, } } + /* + * If min_depth_nonopt was set, best_nonopt must also be set + */ + if (use_nonoptimized && (min_depth_nonopt < min_depth_opt)) + return best_nonopt; return best_opt ? best_opt : best_nonopt; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option 2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne @ 2023-09-27 7:41 ` Hannes Reinecke 2023-09-27 11:31 ` Sagi Grimberg 2023-09-27 13:45 ` Ewan Milne 0 siblings, 2 replies; 24+ messages in thread From: Hannes Reinecke @ 2023-09-27 7:41 UTC (permalink / raw) To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard On 9/25/23 18:31, Ewan D. Milne wrote: > Setting nvme_core.use_nonoptimized=true will cause the path > selector to treat optimized and nonoptimized paths equally. > > This is because although an NVMe fabrics target device may report > an unoptimized ANA state, it is possible that other factors such > as fabric latency are a large factor in the I/O service time. And, > throughput may improve overall if nonoptimized ports are also used. > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > --- > drivers/nvme/host/multipath.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > No. Please don't. There's a reason why controllers specify paths as 'active/optimized' or 'active/non-optimized'. If they had wanted us to use all paths they would have put them into the same group. They tend to get very unhappy if you start using them at the same time. (Triggering failover etc.) Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option 2023-09-27 7:41 ` Hannes Reinecke @ 2023-09-27 11:31 ` Sagi Grimberg 2023-09-27 13:11 ` John Meneghini 2023-09-27 13:45 ` Ewan Milne 1 sibling, 1 reply; 24+ messages in thread From: Sagi Grimberg @ 2023-09-27 11:31 UTC (permalink / raw) To: Hannes Reinecke, Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard >> Setting nvme_core.use_nonoptimized=true will cause the path >> selector to treat optimized and nonoptimized paths equally. >> >> This is because although an NVMe fabrics target device may report >> an unoptimized ANA state, it is possible that other factors such >> as fabric latency are a large factor in the I/O service time. And, >> throughput may improve overall if nonoptimized ports are also used. >> >> Signed-off-by: Ewan D. Milne <emilne@redhat.com> >> --- >> drivers/nvme/host/multipath.c | 22 +++++++++++++++++++--- >> 1 file changed, 19 insertions(+), 3 deletions(-) >> > No. Please don't. > > There's a reason why controllers specify paths as 'active/optimized' or > 'active/non-optimized'. If they had wanted us to use all paths they > would have put them into the same group. > They tend to get very unhappy if you start using them at the same time. > (Triggering failover etc.) I have to agree here. This is effectively a modparam that says all paths are optimized regardless of what the controller reports. While I do acknowledge that there may be some merit to use non-optimized paths as well, but its almost impossible to know some latent optimum path distribution. Hence the host forfeits even attempting. If the controller wants all path used, it should make all paths optimized and the host can examine QD accumulating on some paths vs others. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option 2023-09-27 11:31 ` Sagi Grimberg @ 2023-09-27 13:11 ` John Meneghini 0 siblings, 0 replies; 24+ messages in thread From: John Meneghini @ 2023-09-27 13:11 UTC (permalink / raw) To: Sagi Grimberg, Hannes Reinecke, Ewan D. Milne, linux-nvme; +Cc: tsong, mlombard Ewan I discussed this patch and agreed this is not something we want to go upstream. It was only included here for completeness. This patch was only used to increase the number of active paths during the testing of patches 01 and 02. The test bed Ewan used originally had only 4 35Gbps nvme-tcp contollers (2 active optimized and 2 active non-optimized). He used this patch to change the multi-pathing policy and enable the use all 4 controllers - resulting in 4 active paths. Those test results can be seen here: https://people.redhat.com/jmeneghi/.multipath/test1/A400-TCP-FIO-RR.ps - round-robin io tests https://people.redhat.com/jmeneghi/.multipath/test1/A400-TCP-FIO-QD.ps - queue-depth io tests After sending these patches upstream on Monday Ewan and I built a new test bed with 8 controllers - 4 active optimize paths and 4 non-optimized paths. This provided a true multi-path test bed and the use_nonoptimized patch wasn't needed. In addition, this test bed has a mixed controller subsystem consisting of 4 32GB nvme-fc controllers and 4 100Gbps nvme-tcp controllers. This provided the optimal mix of active optimized controllers paths with different transports that have inherently different latency characteristics. [root@rhel-storage-104 ~]# nvme list-subsys nvme-subsys3 - NQN=nqn.1992-08.com.netapp:sn.2b82d9b13bb211ee8744d039ea989119:subsystem.SS104a \ +- nvme10 fc traddr=nn-0x2027d039ea98949e:pn-0x202cd039ea98949e,host_traddr=nn-0x200000109b9b7f0d:pn-0x100000109b9b7f0d live +- nvme11 fc traddr=nn-0x2027d039ea98949e:pn-0x2029d039ea98949e,host_traddr=nn-0x200000109b9b7f0c:pn-0x100000109b9b7f0c live +- nvme12 fc traddr=nn-0x2027d039ea98949e:pn-0x2028d039ea98949e,host_traddr=nn-0x200000109b9b7f0d:pn-0x100000109b9b7f0d live +- nvme13 tcp traddr=172.18.50.13,trsvcid=4420,src_addr=172.18.50.3 live +- nvme2 tcp traddr=172.18.60.16,trsvcid=4420,src_addr=172.18.60.4 live +- nvme3 fc traddr=nn-0x2027d039ea98949e:pn-0x202dd039ea98949e,host_traddr=nn-0x200000109b9b7f0c:pn-0x100000109b9b7f0c live +- nvme4 tcp traddr=172.18.50.15,trsvcid=4420,src_addr=172.18.50.3 live +- nvme9 tcp traddr=172.18.60.14,trsvcid=4420,src_addr=172.18.60.4 live I shared the performance graphs for this testbed using these patches at the ALPSS conference today. Graphs of inflight I/O on 4 Optimized paths (2 FC, 2 TCP) with round-robin: https://people.redhat.com/jmeneghi/.multipath/test2/A400-TEST1-FIO-RR.ps and queue-depth: https://people.redhat.com/jmeneghi/.multipath/test2/A400-TEST1-FIO-QD.ps Also a included is a graph showing the combined number of inflight I/Os on all paths, plotted with both RR and QD: https://people.redhat.com/jmeneghi/.multipath/test2/A400-TEST1-FIO-MAX.ps What we see in these graphs is that RR is only using 1 of the 4 possible paths, and with QD all 4 paths are used about equally while the maximum inflight I/Os count almost doubles. Hope this helps. John A. Meneghini Senior Principal Platform Storage Engineer RHEL SST - Platform Storage Group jmeneghi@redhat.com On 9/27/23 13:31, Sagi Grimberg wrote: > >>> Setting nvme_core.use_nonoptimized=true will cause the path >>> selector to treat optimized and nonoptimized paths equally. >>> >>> This is because although an NVMe fabrics target device may report >>> an unoptimized ANA state, it is possible that other factors such >>> as fabric latency are a large factor in the I/O service time. And, >>> throughput may improve overall if nonoptimized ports are also used. >>> >>> Signed-off-by: Ewan D. Milne <emilne@redhat.com> >>> --- >>> drivers/nvme/host/multipath.c | 22 +++++++++++++++++++--- >>> 1 file changed, 19 insertions(+), 3 deletions(-) >>> >> No. Please don't. >> >> There's a reason why controllers specify paths as 'active/optimized' or 'active/non-optimized'. If they had wanted us to use >> all paths they would have put them into the same group. >> They tend to get very unhappy if you start using them at the same time. >> (Triggering failover etc.) > > I have to agree here. This is effectively a modparam that says > all paths are optimized regardless of what the controller reports. > > While I do acknowledge that there may be some merit to use non-optimized > paths as well, but its almost impossible to know some latent optimum > path distribution. Hence the host forfeits even attempting. > > If the controller wants all path used, it should make all paths > optimized and the host can examine QD accumulating on some paths > vs others. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option 2023-09-27 7:41 ` Hannes Reinecke 2023-09-27 11:31 ` Sagi Grimberg @ 2023-09-27 13:45 ` Ewan Milne 1 sibling, 0 replies; 24+ messages in thread From: Ewan Milne @ 2023-09-27 13:45 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-nvme, tsong, jmeneghi, mlombard Yes, you are probably right. I added this patch because what I found was that, for arrays I was using, the nonoptimized paths had almost the same characteristics as the optimized paths for some workloads. They had a small amount of additional latency. And, it is helpful to have paths available with slightly different performance in order to experiment with the behavior of different path selectors. I'm OK with omitting this patch but I wanted people to see how it could be done. -Ewan On Wed, Sep 27, 2023 at 3:41 AM Hannes Reinecke <hare@suse.de> wrote: > > On 9/25/23 18:31, Ewan D. Milne wrote: > > Setting nvme_core.use_nonoptimized=true will cause the path > > selector to treat optimized and nonoptimized paths equally. > > > > This is because although an NVMe fabrics target device may report > > an unoptimized ANA state, it is possible that other factors such > > as fabric latency are a large factor in the I/O service time. And, > > throughput may improve overall if nonoptimized ports are also used. > > > > Signed-off-by: Ewan D. Milne <emilne@redhat.com> > > --- > > drivers/nvme/host/multipath.c | 22 +++++++++++++++++++--- > > 1 file changed, 19 insertions(+), 3 deletions(-) > > > No. Please don't. > > There's a reason why controllers specify paths as 'active/optimized' or > 'active/non-optimized'. If they had wanted us to use all paths they > would have put them into the same group. > They tend to get very unhappy if you start using them at the same time. > (Triggering failover etc.) > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew > Myers, Andrew McDonald, Martje Boudien Moerman > ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2023-10-04 9:19 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne 2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne 2023-09-25 20:56 ` Bart Van Assche 2023-09-27 14:49 ` Ewan Milne 2023-09-27 14:58 ` Bart Van Assche 2023-09-27 7:36 ` Hannes Reinecke 2023-09-27 11:37 ` Ming Lei 2023-09-27 13:34 ` Ewan Milne 2023-10-03 20:11 ` Uday Shankar 2023-10-04 9:19 ` Sagi Grimberg 2023-09-27 9:49 ` Ming Lei 2023-09-27 13:54 ` Ewan Milne 2023-09-27 10:56 ` Sagi Grimberg 2023-09-27 13:50 ` Ewan Milne 2023-09-28 10:56 ` Sagi Grimberg 2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne 2023-09-27 7:38 ` Hannes Reinecke 2023-09-27 11:02 ` Sagi Grimberg 2023-09-27 13:42 ` Ewan Milne 2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne 2023-09-27 7:41 ` Hannes Reinecke 2023-09-27 11:31 ` Sagi Grimberg 2023-09-27 13:11 ` John Meneghini 2023-09-27 13:45 ` Ewan Milne
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox