[PATCH 0/3] NVMe multipath path selector enhancements

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH 0/3] NVMe multipath path selector enhancements
@ 2023-09-25 16:31 Ewan D. Milne
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw)
  To: linux-nvme; +Cc: tsong, jmeneghi, mlombard

This patch series adds a "queue-depth" iopolicy to the NVMe multipath
path selector code.  It also adds the capability to optionally allow
the use of ANA nonoptimized paths as well as optimized paths, this can
be beneficial in some cases and helps to provide additional paths to
verify the desired target port utilization.

Signend-off-by: Ewan D. Milne <emilne@redhat.com>

Ewan D. Milne (3):
  block: introduce blk_queue_nr_active()
  Implemented new iopolicy "queue-depth"
  nvme-multipath: add "use_nonoptimized" module option

 block/blk-mq.h                |  5 ---
 drivers/nvme/host/multipath.c | 70 ++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h      |  1 +
 include/linux/blk-mq.h        | 33 +++++++++++++----
 4 files changed, 92 insertions(+), 17 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne
@ 2023-09-25 16:31 ` Ewan D. Milne
  2023-09-25 20:56   ` Bart Van Assche
                     ` (3 more replies)
  2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne
  2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne
  2 siblings, 4 replies; 24+ messages in thread
From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw)
  To: linux-nvme; +Cc: tsong, jmeneghi, mlombard

Returns a count of the total number of active requests
in a queue.  For non-shared tags (the usual case) this is
the sum of nr_active from all of the hctxs.

Signed-off-by: Ewan D. Milne <emilne@redhat.com>
---
 block/blk-mq.h         |  5 -----
 include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++-------
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1743857e0b01..fbc65eefa017 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
 	return tag < tags->nr_reserved_tags;
 }
 
-static inline bool blk_mq_is_shared_tags(unsigned int flags)
-{
-	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
-}
-
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (data->rq_flags & RQF_SCHED_TAGS)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 01e8c31db665..c921ae5236ab 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
 
 bool blk_mq_queue_inflight(struct request_queue *q);
 
+#define queue_for_each_hw_ctx(q, hctx, i)				\
+	xa_for_each(&(q)->hctx_table, (i), (hctx))
+
+#define hctx_for_each_ctx(hctx, ctx, i)					\
+	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
+	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
+
+static inline bool blk_mq_is_shared_tags(unsigned int flags)
+{
+	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
+static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
+{
+	unsigned int nr_active = 0;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
+			return atomic_read(&q->nr_active_requests_shared_tags);
+		nr_active += atomic_read(&hctx->nr_active);
+	}
+	return nr_active;
+}
+
 enum {
 	/* return when out of requests */
 	BLK_MQ_REQ_NOWAIT	= (__force blk_mq_req_flags_t)(1 << 0),
@@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
 	return rq + 1;
 }
 
-#define queue_for_each_hw_ctx(q, hctx, i)				\
-	xa_for_each(&(q)->hctx_table, (i), (hctx))
-
-#define hctx_for_each_ctx(hctx, ctx, i)					\
-	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
-	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
-
 static inline void blk_mq_cleanup_rq(struct request *rq)
 {
 	if (rq->q->mq_ops->cleanup_rq)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
@ 2023-09-25 20:56   ` Bart Van Assche
  2023-09-27 14:49     ` Ewan Milne
  2023-09-27  7:36   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2023-09-25 20:56 UTC (permalink / raw)
  To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard

On 9/25/23 09:31, Ewan D. Milne wrote:
> +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> +{
> +	unsigned int nr_active = 0;
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned long i;
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> +			return atomic_read(&q->nr_active_requests_shared_tags);
> +		nr_active += atomic_read(&hctx->nr_active);
> +	}
> +	return nr_active;
> +}

The above function should never be called from any command submission
code path. Hence, I think it should be added in a .c file instead of
include/linux/blk-mq.h.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-25 20:56   ` Bart Van Assche
@ 2023-09-27 14:49     ` Ewan Milne
  2023-09-27 14:58       ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Ewan Milne @ 2023-09-27 14:49 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-nvme, tsong, jmeneghi, mlombard

The function is only called in one place by the queue-depth nvme path
selector in a
later patch in the series, so it is called in the submission path,
which is why I made it inline.

Or did you mean that the function is too expensive?  See other discussions.

-Ewan

On Mon, Sep 25, 2023 at 4:56 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/25/23 09:31, Ewan D. Milne wrote:
> > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> > +{
> > +     unsigned int nr_active = 0;
> > +     struct blk_mq_hw_ctx *hctx;
> > +     unsigned long i;
> > +
> > +     queue_for_each_hw_ctx(q, hctx, i) {
> > +             if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> > +                     return atomic_read(&q->nr_active_requests_shared_tags);
> > +             nr_active += atomic_read(&hctx->nr_active);
> > +     }
> > +     return nr_active;
> > +}
>
> The above function should never be called from any command submission
> code path. Hence, I think it should be added in a .c file instead of
> include/linux/blk-mq.h.
>
> Thanks,
>
> Bart.
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27 14:49     ` Ewan Milne
@ 2023-09-27 14:58       ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2023-09-27 14:58 UTC (permalink / raw)
  To: Ewan Milne; +Cc: linux-nvme, tsong, jmeneghi, mlombard

On 9/27/23 07:49, Ewan Milne wrote:
> On Mon, Sep 25, 2023 at 4:56 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 9/25/23 09:31, Ewan D. Milne wrote:
>>> +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
>>> +{
>>> +     unsigned int nr_active = 0;
>>> +     struct blk_mq_hw_ctx *hctx;
>>> +     unsigned long i;
>>> +
>>> +     queue_for_each_hw_ctx(q, hctx, i) {
>>> +             if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
>>> +                     return atomic_read(&q->nr_active_requests_shared_tags);
>>> +             nr_active += atomic_read(&hctx->nr_active);
>>> +     }
>>> +     return nr_active;
>>> +}
>>
>> The above function should never be called from any command submission
>> code path. Hence, I think it should be added in a .c file instead of
>> include/linux/blk-mq.h.
 >
 > The function is only called in one place by the queue-depth nvme path
 > selector in a
 > later patch in the series, so it is called in the submission path,
 > which is why I made it inline.

Hi Ewan,

In order to keep include/linux/blk-mq.h readable and in order to minimize
the kernel compilation time, nontrivial functions that are not called from
the hot path should occur in a .c file instead of in a .h file.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
  2023-09-25 20:56   ` Bart Van Assche
@ 2023-09-27  7:36   ` Hannes Reinecke
  2023-09-27 11:37     ` Ming Lei
  2023-09-27 13:34     ` Ewan Milne
  2023-09-27  9:49   ` Ming Lei
  2023-09-27 10:56   ` Sagi Grimberg
  3 siblings, 2 replies; 24+ messages in thread
From: Hannes Reinecke @ 2023-09-27  7:36 UTC (permalink / raw)
  To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard

On 9/25/23 18:31, Ewan D. Milne wrote:
> Returns a count of the total number of active requests
> in a queue.  For non-shared tags (the usual case) this is
> the sum of nr_active from all of the hctxs.
> 
Bit of an exaggeration here.
Shared tags are in use if the hardware supports only a global tag space
(ie basically all SCSI and FC HBAs).

> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>   block/blk-mq.h         |  5 -----
>   include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++-------
>   2 files changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 1743857e0b01..fbc65eefa017 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
>   	return tag < tags->nr_reserved_tags;
>   }
>   
> -static inline bool blk_mq_is_shared_tags(unsigned int flags)
> -{
> -	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> -}
> -
>   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
>   {
>   	if (data->rq_flags & RQF_SCHED_TAGS)
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 01e8c31db665..c921ae5236ab 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>   
>   bool blk_mq_queue_inflight(struct request_queue *q);
>   
> +#define queue_for_each_hw_ctx(q, hctx, i)				\
> +	xa_for_each(&(q)->hctx_table, (i), (hctx))
> +
> +#define hctx_for_each_ctx(hctx, ctx, i)					\
> +	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
> +	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> +
> +static inline bool blk_mq_is_shared_tags(unsigned int flags)
> +{
> +	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> +}
> +
> +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> +{
> +	unsigned int nr_active = 0;
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned long i;
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> +			return atomic_read(&q->nr_active_requests_shared_tags);
> +		nr_active += atomic_read(&hctx->nr_active);
> +	}
> +	return nr_active;
> +}
> +
>   enum {
>   	/* return when out of requests */
>   	BLK_MQ_REQ_NOWAIT	= (__force blk_mq_req_flags_t)(1 << 0),
> @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>   	return rq + 1;
>   }
>   
> -#define queue_for_each_hw_ctx(q, hctx, i)				\
> -	xa_for_each(&(q)->hctx_table, (i), (hctx))
> -
> -#define hctx_for_each_ctx(hctx, ctx, i)					\
> -	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
> -	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> -
>   static inline void blk_mq_cleanup_rq(struct request *rq)
>   {
>   	if (rq->q->mq_ops->cleanup_rq)

Well. As discussed, using xarray on 'small' arrays is horrible for 
performance. We really should revert the patch from Ming to
turn it back into a simple array; that'll make traversing much faster.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27  7:36   ` Hannes Reinecke
@ 2023-09-27 11:37     ` Ming Lei
  2023-09-27 13:34     ` Ewan Milne
  1 sibling, 0 replies; 24+ messages in thread
From: Ming Lei @ 2023-09-27 11:37 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ewan D. Milne, linux-nvme, tsong, jmeneghi, mlombard, ming.lei

On Wed, Sep 27, 2023 at 09:36:11AM +0200, Hannes Reinecke wrote:
> On 9/25/23 18:31, Ewan D. Milne wrote:
> > Returns a count of the total number of active requests
> > in a queue.  For non-shared tags (the usual case) this is
> > the sum of nr_active from all of the hctxs.
> > 
> Bit of an exaggeration here.
> Shared tags are in use if the hardware supports only a global tag space
> (ie basically all SCSI and FC HBAs).
> 
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >   block/blk-mq.h         |  5 -----
> >   include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++-------
> >   2 files changed, 26 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 1743857e0b01..fbc65eefa017 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
> >   	return tag < tags->nr_reserved_tags;
> >   }
> > -static inline bool blk_mq_is_shared_tags(unsigned int flags)
> > -{
> > -	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> > -}
> > -
> >   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
> >   {
> >   	if (data->rq_flags & RQF_SCHED_TAGS)
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 01e8c31db665..c921ae5236ab 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >   bool blk_mq_queue_inflight(struct request_queue *q);
> > +#define queue_for_each_hw_ctx(q, hctx, i)				\
> > +	xa_for_each(&(q)->hctx_table, (i), (hctx))
> > +
> > +#define hctx_for_each_ctx(hctx, ctx, i)					\
> > +	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
> > +	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> > +
> > +static inline bool blk_mq_is_shared_tags(unsigned int flags)
> > +{
> > +	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> > +}
> > +
> > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> > +{
> > +	unsigned int nr_active = 0;
> > +	struct blk_mq_hw_ctx *hctx;
> > +	unsigned long i;
> > +
> > +	queue_for_each_hw_ctx(q, hctx, i) {
> > +		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> > +			return atomic_read(&q->nr_active_requests_shared_tags);
> > +		nr_active += atomic_read(&hctx->nr_active);
> > +	}
> > +	return nr_active;
> > +}
> > +
> >   enum {
> >   	/* return when out of requests */
> >   	BLK_MQ_REQ_NOWAIT	= (__force blk_mq_req_flags_t)(1 << 0),
> > @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
> >   	return rq + 1;
> >   }
> > -#define queue_for_each_hw_ctx(q, hctx, i)				\
> > -	xa_for_each(&(q)->hctx_table, (i), (hctx))
> > -
> > -#define hctx_for_each_ctx(hctx, ctx, i)					\
> > -	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
> > -	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> > -
> >   static inline void blk_mq_cleanup_rq(struct request *rq)
> >   {
> >   	if (rq->q->mq_ops->cleanup_rq)
> 
> Well. As discussed, using xarray on 'small' arrays is horrible for
> performance. We really should revert the patch from Ming to
> turn it back into a simple array; that'll make traversing much faster.

But queue_for_each_hw_ctx() isn't supposed for fast path, and it is always
expensive to run cross-queue things.

It also means any way running cross-queue thing in fast path may not be
one good idea.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27  7:36   ` Hannes Reinecke
  2023-09-27 11:37     ` Ming Lei
@ 2023-09-27 13:34     ` Ewan Milne
  2023-10-03 20:11       ` Uday Shankar
  1 sibling, 1 reply; 24+ messages in thread
From: Ewan Milne @ 2023-09-27 13:34 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-nvme, tsong, jmeneghi, mlombard

Maybe I missed something but I only found 2 cases that set
BLK_MQ_F_TAG_HCTX_SHARED.

For NVMe/TCP, at least on the system I used that has 80 cores, there
were only 1 hctxs
but I haven't looked into how many e.g. we have for PCI devices.  FC I
think has a smallish number.

-Ewan

On Wed, Sep 27, 2023 at 3:36 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 9/25/23 18:31, Ewan D. Milne wrote:
> > Returns a count of the total number of active requests
> > in a queue.  For non-shared tags (the usual case) this is
> > the sum of nr_active from all of the hctxs.
> >
> Bit of an exaggeration here.
> Shared tags are in use if the hardware supports only a global tag space
> (ie basically all SCSI and FC HBAs).
>
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >   block/blk-mq.h         |  5 -----
> >   include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++-------
> >   2 files changed, 26 insertions(+), 12 deletions(-)
> >
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 1743857e0b01..fbc65eefa017 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
> >       return tag < tags->nr_reserved_tags;
> >   }
> >
> > -static inline bool blk_mq_is_shared_tags(unsigned int flags)
> > -{
> > -     return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> > -}
> > -
> >   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
> >   {
> >       if (data->rq_flags & RQF_SCHED_TAGS)
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 01e8c31db665..c921ae5236ab 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >
> >   bool blk_mq_queue_inflight(struct request_queue *q);
> >
> > +#define queue_for_each_hw_ctx(q, hctx, i)                            \
> > +     xa_for_each(&(q)->hctx_table, (i), (hctx))
> > +
> > +#define hctx_for_each_ctx(hctx, ctx, i)                                      \
> > +     for ((i) = 0; (i) < (hctx)->nr_ctx &&                           \
> > +          ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> > +
> > +static inline bool blk_mq_is_shared_tags(unsigned int flags)
> > +{
> > +     return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> > +}
> > +
> > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> > +{
> > +     unsigned int nr_active = 0;
> > +     struct blk_mq_hw_ctx *hctx;
> > +     unsigned long i;
> > +
> > +     queue_for_each_hw_ctx(q, hctx, i) {
> > +             if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> > +                     return atomic_read(&q->nr_active_requests_shared_tags);
> > +             nr_active += atomic_read(&hctx->nr_active);
> > +     }
> > +     return nr_active;
> > +}
> > +
> >   enum {
> >       /* return when out of requests */
> >       BLK_MQ_REQ_NOWAIT       = (__force blk_mq_req_flags_t)(1 << 0),
> > @@ -941,13 +967,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
> >       return rq + 1;
> >   }
> >
> > -#define queue_for_each_hw_ctx(q, hctx, i)                            \
> > -     xa_for_each(&(q)->hctx_table, (i), (hctx))
> > -
> > -#define hctx_for_each_ctx(hctx, ctx, i)                                      \
> > -     for ((i) = 0; (i) < (hctx)->nr_ctx &&                           \
> > -          ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> > -
> >   static inline void blk_mq_cleanup_rq(struct request *rq)
> >   {
> >       if (rq->q->mq_ops->cleanup_rq)
>
> Well. As discussed, using xarray on 'small' arrays is horrible for
> performance. We really should revert the patch from Ming to
> turn it back into a simple array; that'll make traversing much faster.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27 13:34     ` Ewan Milne
@ 2023-10-03 20:11       ` Uday Shankar
  2023-10-04  9:19         ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Uday Shankar @ 2023-10-03 20:11 UTC (permalink / raw)
  To: Ewan Milne
  Cc: Hannes Reinecke, linux-nvme, tsong, jmeneghi, mlombard,
	Thomas Song, Randy Jennings, Prabhath Sajeepa

Hi Ewan,

> For NVMe/TCP, at least on the system I used that has 80 cores, there
> were only 1 hctxs

This sounds like a target limitation and/or an effect of the command
used to create the connection. The host driver will respect any limits
imposed by the target or the command creating the connection, but will
otherwise prefer to create one I/O queue per CPU (see nvmf_nr_io_queues,
nvme_set_queue_count, and nvme_fc_create_io_queues). This is the
suggested behavior from the spec as well (section 2.1):

> For example, on a four core processor based system, there may be a
> queue pair per core to avoid locking and ensure data structures are
> created in the appropriate processor core’s cache.

Targets aiming to provide high performance should thus allow for the
creation of more than one I/O queue. This is the case with the in-kernel
target (the default queue count limit is NVMET_NR_QUEUES, or 128) as
well as Pure's Flasharray.

Testing against such a target, the approach of looping through all hctxs
and summing the nr_active on every I/O comes with a very significant
latency hit. I used fio to run benchmarks (jobfile attached) and the
in-kernel NVMe-TCP target, whose generous limit of 128 allows the host
to create one queue per CPU, for 64 queues total. I used a kernel built
with two queue-depth path selectors (patch against Linux 6.5 attached) -
one is the one you posted, and the other is the atomics-based one that
Thomas wrote (with a small change so that path selectors not using the
atomic do not have to pay its cost). Here are the submission latency
statistics with the looping-over-hctx approach (the full fio output is
also attached):

    slat (usec): min=70, max=598, avg=78.35, stdev= 7.76

And with atomics:

    slat (usec): min=24, max=491, avg=32.51, stdev= 6.47

Note the ~46us difference in the average submission latency. I used
ftrace to measure the amount of time spent in the path selector
functions, and the results confirm that essentially all of this
additional latency is due to the path selector. For the hctx looping
approach:

# grep 'nvme_mpath_start_request' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.20338877564886418
# grep 'nvme_mpath_end_request' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.5282572360375025
# grep 'nvme_find_path' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
46.14676273950285

vs atomics:

# grep 'nvme_mpath_start_request' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.20764315633022504
# grep 'nvme_mpath_end_request' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.5173406296446426
# grep 'nvme_find_path' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.4562795568166467

I don't know if this is due to the "using xarray on 'small' arrays is
horrible for performance" that Hannes mentioned. Maybe reverting that
patch would help things, but I still prefer the atomics approach for its
simplicity and the fact that the data does not indicate that the two new
RMW ops per I/O are a source of issues. If the contention is still
considered a problem, we can "split" the atomic into pieces along:

- namespace boundaries, so that the atomic lives in struct nvme_ns
  instead of struct nvme_ctrl. if different CPUs are doing I/O to
  different namespaces (which may be a common access pattern), this will
  reduce the contention on the atomic. this would give us pretty much a
  1:1 translation of the queue-length path selector from dm.
- hctx boundaries - when we calculate the queue depth, instead of
  summing the depths for every path's hctxs, only consider the hctx
  associated to the local CPU.
- CPU boundaries - make the queue depth variable fully per-CPU. this
  should be equivalent to the previous approach in the case where the
  host can create one hctx per CPU.

Of course, all of these come with the disadvantage of sharing less
information about path quality. But that's what it boils down to -
sharing of rapidly-changing information between CPUs is expensive.

We can also consider "standard" optimizations around the atomic(s), like
ensuring each one gets its own cache line.

---

fio jobfile:

[global]
ioengine=libaio
randrepeat=1
direct=1
group_reporting
blocksize=512k
iodepth=1
numjobs=1
runtime=15
time_based
rw=randread

[JOB1]
cpus_allowed=1
filename=/dev/nvme4n1

full fio results with hctx-looping approach:

JOB1: (g=0): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
fio-3.19
Starting 1 process

JOB1: (groupid=0, jobs=1): err= 0: pid=185197: Mon Oct  2 15:22:39 2023
  read: IOPS=490, BW=245MiB/s (257MB/s)(3680MiB/15001msec)
    slat (usec): min=70, max=598, avg=78.35, stdev= 7.76
    clat (usec): min=1662, max=2743, avg=1958.35, stdev=135.18
     lat (usec): min=1739, max=2901, avg=2036.92, stdev=135.45
    clat percentiles (usec):
     |  1.00th=[ 1729],  5.00th=[ 1778], 10.00th=[ 1811], 20.00th=[ 1844],
     | 30.00th=[ 1876], 40.00th=[ 1909], 50.00th=[ 1942], 60.00th=[ 1975],
     | 70.00th=[ 2008], 80.00th=[ 2057], 90.00th=[ 2147], 95.00th=[ 2212],
     | 99.00th=[ 2376], 99.50th=[ 2442], 99.90th=[ 2540], 99.95th=[ 2671],
     | 99.99th=[ 2737]
   bw (  KiB/s): min=248832, max=254976, per=100.00%, avg=251490.55, stdev=1468.84, samples=29
   iops        : min=  486, max=  498, avg=491.17, stdev= 2.90, samples=29
  lat (msec)   : 2=68.68%, 4=31.32%
  cpu          : usr=0.19%, sys=4.08%, ctx=7359, majf=0, minf=138
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7359,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=245MiB/s (257MB/s), 245MiB/s-245MiB/s (257MB/s-257MB/s), io=3680MiB (3858MB), run=15001-15001msec

Disk stats (read/write):
  nvme4n1: ios=7274/0, merge=0/0, ticks=14134/0, in_queue=14134, util=99.44%

full fio results with atomics approach:

JOB1: (g=0): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
fio-3.19
Starting 1 process

JOB1: (groupid=0, jobs=1): err= 0: pid=185287: Mon Oct  2 15:23:03 2023
  read: IOPS=493, BW=247MiB/s (259MB/s)(3701MiB/15002msec)
    slat (usec): min=24, max=491, avg=32.51, stdev= 6.47
    clat (usec): min=1648, max=24772, avg=1992.77, stdev=329.21
     lat (usec): min=1679, max=24804, avg=2025.49, stdev=329.31
    clat percentiles (usec):
     |  1.00th=[ 1745],  5.00th=[ 1795], 10.00th=[ 1827], 20.00th=[ 1876],
     | 30.00th=[ 1909], 40.00th=[ 1942], 50.00th=[ 1975], 60.00th=[ 2008],
     | 70.00th=[ 2040], 80.00th=[ 2089], 90.00th=[ 2180], 95.00th=[ 2245],
     | 99.00th=[ 2409], 99.50th=[ 2474], 99.90th=[ 2606], 99.95th=[ 2868],
     | 99.99th=[24773]
   bw (  KiB/s): min=235520, max=260096, per=100.00%, avg=252927.10, stdev=4438.49, samples=29
   iops        : min=  460, max=  508, avg=493.97, stdev= 8.68, samples=29
  lat (msec)   : 2=60.26%, 4=39.70%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.15%, sys=1.91%, ctx=7401, majf=0, minf=139
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7401,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=247MiB/s (259MB/s), 247MiB/s-247MiB/s (259MB/s-259MB/s), io=3701MiB (3880MB), run=15002-15002msec

Disk stats (read/write):
  nvme4n1: ios=7316/0, merge=0/0, ticks=14470/0, in_queue=14470, util=99.24%

patch against Linux 6.5 used for testing. this needs polish and is buggy
when the iopolicy changes while I/O is outstanding, but I avoided
hitting the bug during experiments by not doing I/O while changing the
iopolicy:

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1743857e0b01..fbc65eefa017 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
 	return tag < tags->nr_reserved_tags;
 }
 
-static inline bool blk_mq_is_shared_tags(unsigned int flags)
-{
-	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
-}
-
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (data->rq_flags & RQF_SCHED_TAGS)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0a88d7bdc5e3..f71aa1e6f537 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,8 @@ MODULE_PARM_DESC(multipath,
 static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
+	[NVME_IOPOLICY_QD_ATOMIC]      = "queue-depth-atomic",
+	[NVME_IOPOLICY_QD_EWAN] = "queue-depth-ewan",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +31,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_NUMA;
 	else if (!strncmp(val, "round-robin", 11))
 		iopolicy = NVME_IOPOLICY_RR;
+	else if (!strncmp(val, "queue-depth-atomic", 18))
+		iopolicy = NVME_IOPOLICY_QD_ATOMIC;
+	else if (!strncmp(val, "queue-depth-ewan", 16))
+		iopolicy = NVME_IOPOLICY_QD_EWAN;
 	else
 		return -EINVAL;
 
@@ -126,6 +132,11 @@ void nvme_mpath_start_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
 	struct gendisk *disk = ns->head->disk;
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+
+	if (READ_ONCE(subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) {
+                atomic_inc(&ns->ctrl->nr_active);
+        }
 
 	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
 		return;
@@ -139,9 +150,15 @@ EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
 void nvme_mpath_end_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+
+	if (READ_ONCE(subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) {
+                atomic_dec(&ns->ctrl->nr_active);
+        }
 
 	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
+
 	bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
 			 blk_rq_bytes(rq) >> SECTOR_SHIFT,
 			 nvme_req(rq)->start_time);
@@ -329,13 +346,69 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 	return found;
 }
 
+static struct nvme_ns *nvme_queue_depth_atomic_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+	unsigned int depth;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		depth = atomic_read(&ns->ctrl->nr_active);
+
+		if (ns->ana_state == NVME_ANA_OPTIMIZED &&
+		    depth < min_depth_opt) {
+			min_depth_opt = depth;
+			best_opt = ns;
+		}
+
+		if (ns->ana_state == NVME_ANA_NONOPTIMIZED &&
+		    depth < min_depth_nonopt) {
+			min_depth_nonopt = depth;
+			best_nonopt = ns;
+		}
+	}
+
+	return best_opt ? best_opt : best_nonopt;
+}
+
+static struct nvme_ns *nvme_queue_depth_ewan_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+	unsigned int depth;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		depth = blk_mq_queue_nr_active(ns->queue);
+
+		if (ns->ana_state == NVME_ANA_OPTIMIZED &&
+		    depth < min_depth_opt) {
+			min_depth_opt = depth;
+			best_opt = ns;
+		}
+
+		if (ns->ana_state == NVME_ANA_NONOPTIMIZED &&
+		    depth < min_depth_nonopt) {
+			min_depth_nonopt = depth;
+			best_nonopt = ns;
+		}
+	}
+
+	return best_opt ? best_opt : best_nonopt;
+}
+
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 {
 	return ns->ctrl->state == NVME_CTRL_LIVE &&
 		ns->ana_state == NVME_ANA_OPTIMIZED;
 }
 
-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+noinline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 {
 	int node = numa_node_id();
 	struct nvme_ns *ns;
@@ -346,6 +419,11 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 
 	if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
 		return nvme_round_robin_path(head, node, ns);
+	else if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC)
+		return nvme_queue_depth_atomic_path(head);
+	else if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_QD_EWAN)
+		return nvme_queue_depth_ewan_path(head);
+
 	if (unlikely(!nvme_path_is_optimized(ns)))
 		return __nvme_find_path(head, node);
 	return ns;
@@ -900,6 +978,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 
 void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl)
 {
+	atomic_set(&ctrl->nr_active, 0);
 	mutex_init(&ctrl->ana_lock);
 	timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
 	INIT_WORK(&ctrl->ana_work, nvme_ana_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f35647c470af..a2e5d09d697d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -342,6 +342,7 @@ struct nvme_ctrl {
 	u8 anatt;
 	u32 anagrpmax;
 	u32 nanagrpid;
+	atomic_t nr_active;
 	struct mutex ana_lock;
 	struct nvme_ana_rsp_hdr *ana_log_buf;
 	size_t ana_log_size;
@@ -389,6 +390,8 @@ struct nvme_ctrl {
 enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
+	NVME_IOPOLICY_QD_ATOMIC,
+	NVME_IOPOLICY_QD_EWAN,
 };
 
 struct nvme_subsystem {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 495ca198775f..a7a32a1fd16c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -718,6 +718,33 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
 
 bool blk_mq_queue_inflight(struct request_queue *q);
 
+#define queue_for_each_hw_ctx(q, hctx, i)				\
+	xa_for_each(&(q)->hctx_table, (i), (hctx))
+
+#define hctx_for_each_ctx(hctx, ctx, i)					\
+	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
+	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
+
+static inline bool blk_mq_is_shared_tags(unsigned int flags)
+{
+	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
+static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
+{
+	unsigned int nr_active = 0;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
+			return atomic_read(&q->nr_active_requests_shared_tags);
+		nr_active += atomic_read(&hctx->nr_active);
+	}
+	return nr_active;
+}
+
+
 enum {
 	/* return when out of requests */
 	BLK_MQ_REQ_NOWAIT	= (__force blk_mq_req_flags_t)(1 << 0),
@@ -943,13 +970,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
 	return rq + 1;
 }
 
-#define queue_for_each_hw_ctx(q, hctx, i)				\
-	xa_for_each(&(q)->hctx_table, (i), (hctx))
-
-#define hctx_for_each_ctx(hctx, ctx, i)					\
-	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
-	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
-
 static inline void blk_mq_cleanup_rq(struct request *rq)
 {
 	if (rq->q->mq_ops->cleanup_rq)



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-10-03 20:11       ` Uday Shankar
@ 2023-10-04  9:19         ` Sagi Grimberg
  0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2023-10-04  9:19 UTC (permalink / raw)
  To: Uday Shankar, Ewan Milne
  Cc: Hannes Reinecke, linux-nvme, tsong, jmeneghi, mlombard,
	Thomas Song, Randy Jennings, Prabhath Sajeepa


> I don't know if this is due to the "using xarray on 'small' arrays is
> horrible for performance" that Hannes mentioned. Maybe reverting that
> patch would help things, but I still prefer the atomics approach for its
> simplicity and the fact that the data does not indicate that the two new
> RMW ops per I/O are a source of issues. If the contention is still
> considered a problem, we can "split" the atomic into pieces along:
> 
> - namespace boundaries, so that the atomic lives in struct nvme_ns
>    instead of struct nvme_ctrl. if different CPUs are doing I/O to
>    different namespaces (which may be a common access pattern), this will
>    reduce the contention on the atomic. this would give us pretty much a
>    1:1 translation of the queue-length path selector from dm.
> - hctx boundaries - when we calculate the queue depth, instead of
>    summing the depths for every path's hctxs, only consider the hctx
>    associated to the local CPU.

I think that this is the best approach. It is matching the existing
synchronization boundary in the io-path today.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
  2023-09-25 20:56   ` Bart Van Assche
  2023-09-27  7:36   ` Hannes Reinecke
@ 2023-09-27  9:49   ` Ming Lei
  2023-09-27 13:54     ` Ewan Milne
  2023-09-27 10:56   ` Sagi Grimberg
  3 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2023-09-27  9:49 UTC (permalink / raw)
  To: Ewan D. Milne; +Cc: linux-nvme, tsong, jmeneghi, mlombard

On Mon, Sep 25, 2023 at 12:31:21PM -0400, Ewan D. Milne wrote:
> Returns a count of the total number of active requests
> in a queue.  For non-shared tags (the usual case) this is
> the sum of nr_active from all of the hctxs.

hctx->nr_active is only updated if BLK_MQ_F_TAG_QUEUE_SHARED is true
and blk_mq_is_shared_tags() is false, so hctx->nr_active isn't
available if BLK_MQ_F_TAG_QUEUE_SHARED is false(the usual case?).


Thanks,
Ming



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27  9:49   ` Ming Lei
@ 2023-09-27 13:54     ` Ewan Milne
  0 siblings, 0 replies; 24+ messages in thread
From: Ewan Milne @ 2023-09-27 13:54 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-nvme, tsong, jmeneghi, mlombard

I instrumented the code to look at nr_active and it did seem to be
updated for all the
nvme fabrics transports I tested, but I'll check.  (As a blk_ ..
function it would have to
work on anything).

Of course the patches depend upon the block layer having the nr_active
count available
somewhere.  I did it this way to avoid having to duplicate the count
in the nvme code.

-Ewan

On Wed, Sep 27, 2023 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Sep 25, 2023 at 12:31:21PM -0400, Ewan D. Milne wrote:
> > Returns a count of the total number of active requests
> > in a queue.  For non-shared tags (the usual case) this is
> > the sum of nr_active from all of the hctxs.
>
> hctx->nr_active is only updated if BLK_MQ_F_TAG_QUEUE_SHARED is true
> and blk_mq_is_shared_tags() is false, so hctx->nr_active isn't
> available if BLK_MQ_F_TAG_QUEUE_SHARED is false(the usual case?).
>
>
> Thanks,
> Ming
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
                     ` (2 preceding siblings ...)
  2023-09-27  9:49   ` Ming Lei
@ 2023-09-27 10:56   ` Sagi Grimberg
  2023-09-27 13:50     ` Ewan Milne
  3 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2023-09-27 10:56 UTC (permalink / raw)
  To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard



On 9/25/23 19:31, Ewan D. Milne wrote:
> Returns a count of the total number of active requests
> in a queue.  For non-shared tags (the usual case) this is
> the sum of nr_active from all of the hctxs.
> 
> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>   block/blk-mq.h         |  5 -----
>   include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++-------
>   2 files changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 1743857e0b01..fbc65eefa017 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
>   	return tag < tags->nr_reserved_tags;
>   }
>   
> -static inline bool blk_mq_is_shared_tags(unsigned int flags)
> -{
> -	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> -}
> -
>   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
>   {
>   	if (data->rq_flags & RQF_SCHED_TAGS)
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 01e8c31db665..c921ae5236ab 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>   
>   bool blk_mq_queue_inflight(struct request_queue *q);
>   
> +#define queue_for_each_hw_ctx(q, hctx, i)				\
> +	xa_for_each(&(q)->hctx_table, (i), (hctx))
> +
> +#define hctx_for_each_ctx(hctx, ctx, i)					\
> +	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
> +	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> +
> +static inline bool blk_mq_is_shared_tags(unsigned int flags)
> +{
> +	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> +}
> +
> +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> +{
> +	unsigned int nr_active = 0;
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned long i;
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> +			return atomic_read(&q->nr_active_requests_shared_tags);
> +		nr_active += atomic_read(&hctx->nr_active);
> +	}

I think that for the purposes of nvme-mpath you should probably be
interested in the hctx mapped to the running cpu, and not the
cumulative active requests.

As for BLK_MQ_F_TAG_HCTX_SHARED, this is dead-code until scsi/null makes
any use of it... but seems fine in theory.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27 10:56   ` Sagi Grimberg
@ 2023-09-27 13:50     ` Ewan Milne
  2023-09-28 10:56       ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Ewan Milne @ 2023-09-27 13:50 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, tsong, jmeneghi, mlombard

I think it is unfortunately necessary to compute the sum from all the hctxs
because in the general case there could be threads on other CPUs submitting
I/O through another hctx that might significantly affect the result, but I can
do some more tests and see how it looks.

-Ewan

On Wed, Sep 27, 2023 at 6:56 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 9/25/23 19:31, Ewan D. Milne wrote:
> > Returns a count of the total number of active requests
> > in a queue.  For non-shared tags (the usual case) this is
> > the sum of nr_active from all of the hctxs.
> >
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >   block/blk-mq.h         |  5 -----
> >   include/linux/blk-mq.h | 33 ++++++++++++++++++++++++++-------
> >   2 files changed, 26 insertions(+), 12 deletions(-)
> >
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 1743857e0b01..fbc65eefa017 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
> >       return tag < tags->nr_reserved_tags;
> >   }
> >
> > -static inline bool blk_mq_is_shared_tags(unsigned int flags)
> > -{
> > -     return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> > -}
> > -
> >   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
> >   {
> >       if (data->rq_flags & RQF_SCHED_TAGS)
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 01e8c31db665..c921ae5236ab 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -716,6 +716,32 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >
> >   bool blk_mq_queue_inflight(struct request_queue *q);
> >
> > +#define queue_for_each_hw_ctx(q, hctx, i)                            \
> > +     xa_for_each(&(q)->hctx_table, (i), (hctx))
> > +
> > +#define hctx_for_each_ctx(hctx, ctx, i)                                      \
> > +     for ((i) = 0; (i) < (hctx)->nr_ctx &&                           \
> > +          ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
> > +
> > +static inline bool blk_mq_is_shared_tags(unsigned int flags)
> > +{
> > +     return flags & BLK_MQ_F_TAG_HCTX_SHARED;
> > +}
> > +
> > +static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
> > +{
> > +     unsigned int nr_active = 0;
> > +     struct blk_mq_hw_ctx *hctx;
> > +     unsigned long i;
> > +
> > +     queue_for_each_hw_ctx(q, hctx, i) {
> > +             if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
> > +                     return atomic_read(&q->nr_active_requests_shared_tags);
> > +             nr_active += atomic_read(&hctx->nr_active);
> > +     }
>
> I think that for the purposes of nvme-mpath you should probably be
> interested in the hctx mapped to the running cpu, and not the
> cumulative active requests.
>
> As for BLK_MQ_F_TAG_HCTX_SHARED, this is dead-code until scsi/null makes
> any use of it... but seems fine in theory.
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] block: introduce blk_queue_nr_active()
  2023-09-27 13:50     ` Ewan Milne
@ 2023-09-28 10:56       ` Sagi Grimberg
  0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2023-09-28 10:56 UTC (permalink / raw)
  To: Ewan Milne; +Cc: linux-nvme, tsong, jmeneghi, mlombard

> I think it is unfortunately necessary to compute the sum from all the hctxs
> because in the general case there could be threads on other CPUs submitting
> I/O through another hctx that might significantly affect the result,

This argument is true in general, other threads issuing I/O on different
contexts may affect each other as you never know how a controller
processes the submission queues.

Its a tradeoff between the level of inspection and a reasonable
overhead. I do not think iterating over all hctxs of all available paths
is reasonable. But lets wait for your testing.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"
  2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
@ 2023-09-25 16:31 ` Ewan D. Milne
  2023-09-27  7:38   ` Hannes Reinecke
  2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne
  2 siblings, 1 reply; 24+ messages in thread
From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw)
  To: linux-nvme; +Cc: tsong, jmeneghi, mlombard

The existing iopolicies are inefficient in some cases, such as
the presence of a path with high latency. The round-robin
policy would use that path equally with faster paths, which
results in sub-optimal performance.

The queue-depth policy instead sends I/O requests down the path
with the least amount of requests in its request queue. Paths
with lower latency will clear requests more quickly and have less
requests in their queues compared to "bad" paths. The aim is to
use those paths the most to bring down overall latency.

[edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
      and compilation warnings, updated MODULE_PARM description
      and changed to use new blk_mq_queue_nr_active() instead of adding
      additional atomic counters.]

Co-developed-by: Thomas Song <tsong@purestorage.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
---
 drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0a88d7bdc5e3..deea7fd4aa95 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
 static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
+	[NVME_IOPOLICY_QD]      = "queue-depth",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_NUMA;
 	else if (!strncmp(val, "round-robin", 11))
 		iopolicy = NVME_IOPOLICY_RR;
+	else if (!strncmp(val, "queue-depth", 11))
+		iopolicy = NVME_IOPOLICY_QD;
 	else
 		return -EINVAL;
 
@@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
 module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
 	&iopolicy, 0644);
 MODULE_PARM_DESC(iopolicy,
-	"Default multipath I/O policy; 'numa' (default) or 'round-robin'");
+	"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
 
 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 {
@@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 	return found;
 }
 
+static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
+		int node)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+	unsigned int depth;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		depth = blk_mq_queue_nr_active(ns->queue);
+
+		switch (ns->ana_state) {
+		case NVME_ANA_OPTIMIZED:
+			if (depth < min_depth_opt) {
+				min_depth_opt = depth;
+				best_opt = ns;
+			}
+			break;
+
+		case NVME_ANA_NONOPTIMIZED:
+			if (depth < min_depth_nonopt) {
+				min_depth_nonopt = depth;
+				best_nonopt = ns;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+
+	return best_opt ? best_opt : best_nonopt;
+}
+
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 {
 	return ns->ctrl->state == NVME_CTRL_LIVE &&
@@ -344,8 +382,14 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 	if (unlikely(!ns))
 		return __nvme_find_path(head, node);
 
-	if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
+	switch(READ_ONCE(head->subsys->iopolicy)) {
+	case NVME_IOPOLICY_RR:
 		return nvme_round_robin_path(head, node, ns);
+
+	case NVME_IOPOLICY_QD:
+		return nvme_queue_depth_path(head, node);
+	}
+
 	if (unlikely(!nvme_path_is_optimized(ns)))
 		return __nvme_find_path(head, node);
 	return ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f35647c470af..938fbed007ce 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -389,6 +389,7 @@ struct nvme_ctrl {
 enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
+	NVME_IOPOLICY_QD,
 };
 
 struct nvme_subsystem {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"
  2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne
@ 2023-09-27  7:38   ` Hannes Reinecke
  2023-09-27 11:02     ` Sagi Grimberg
  2023-09-27 13:42     ` Ewan Milne
  0 siblings, 2 replies; 24+ messages in thread
From: Hannes Reinecke @ 2023-09-27  7:38 UTC (permalink / raw)
  To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard

On 9/25/23 18:31, Ewan D. Milne wrote:
> The existing iopolicies are inefficient in some cases, such as
> the presence of a path with high latency. The round-robin
> policy would use that path equally with faster paths, which
> results in sub-optimal performance.
> 
> The queue-depth policy instead sends I/O requests down the path
> with the least amount of requests in its request queue. Paths
> with lower latency will clear requests more quickly and have less
> requests in their queues compared to "bad" paths. The aim is to
> use those paths the most to bring down overall latency.
> 
> [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>        and compilation warnings, updated MODULE_PARM description
>        and changed to use new blk_mq_queue_nr_active() instead of adding
>        additional atomic counters.]
> 
> Co-developed-by: Thomas Song <tsong@purestorage.com>
> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>   drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
>   drivers/nvme/host/nvme.h      |  1 +
>   2 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 0a88d7bdc5e3..deea7fd4aa95 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>   static const char *nvme_iopolicy_names[] = {
>   	[NVME_IOPOLICY_NUMA]	= "numa",
>   	[NVME_IOPOLICY_RR]	= "round-robin",
> +	[NVME_IOPOLICY_QD]      = "queue-depth",
>   };
>   
>   static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>   		iopolicy = NVME_IOPOLICY_NUMA;
>   	else if (!strncmp(val, "round-robin", 11))
>   		iopolicy = NVME_IOPOLICY_RR;
> +	else if (!strncmp(val, "queue-depth", 11))
> +		iopolicy = NVME_IOPOLICY_QD;
>   	else
>   		return -EINVAL;
>   
> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>   	&iopolicy, 0644);
>   MODULE_PARM_DESC(iopolicy,
> -	"Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> +	"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>   
>   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>   {
> @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>   	return found;
>   }
>   
> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
> +		int node)
> +{
> +	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> +	unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> +	unsigned int depth;
> +
> +	list_for_each_entry_rcu(ns, &head->list, siblings) {
> +		if (nvme_path_is_disabled(ns))
> +			continue;
> +
> +		depth = blk_mq_queue_nr_active(ns->queue);

That would require an 'atomic_read' on each path for each I/O.
Which absolutely sucks for performance.

Considering that we really are only interested in a rough idea about 
path usage, wouldn't it be better to use READ_ONCE()/WRITE_ONCE()
for accumulating the nr of requests per queue?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"
  2023-09-27  7:38   ` Hannes Reinecke
@ 2023-09-27 11:02     ` Sagi Grimberg
  2023-09-27 13:42     ` Ewan Milne
  1 sibling, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2023-09-27 11:02 UTC (permalink / raw)
  To: Hannes Reinecke, Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard


>> The existing iopolicies are inefficient in some cases, such as
>> the presence of a path with high latency. The round-robin
>> policy would use that path equally with faster paths, which
>> results in sub-optimal performance.
>>
>> The queue-depth policy instead sends I/O requests down the path
>> with the least amount of requests in its request queue. Paths
>> with lower latency will clear requests more quickly and have less
>> requests in their queues compared to "bad" paths. The aim is to
>> use those paths the most to bring down overall latency.
>>
>> [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>>        and compilation warnings, updated MODULE_PARM description
>>        and changed to use new blk_mq_queue_nr_active() instead of adding
>>        additional atomic counters.]
>>
>> Co-developed-by: Thomas Song <tsong@purestorage.com>
>> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
>> ---
>>   drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
>>   drivers/nvme/host/nvme.h      |  1 +
>>   2 files changed, 47 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index 0a88d7bdc5e3..deea7fd4aa95 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>>   static const char *nvme_iopolicy_names[] = {
>>       [NVME_IOPOLICY_NUMA]    = "numa",
>>       [NVME_IOPOLICY_RR]    = "round-robin",
>> +    [NVME_IOPOLICY_QD]      = "queue-depth",
>>   };
>>   static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const 
>> struct kernel_param *kp)
>>           iopolicy = NVME_IOPOLICY_NUMA;
>>       else if (!strncmp(val, "round-robin", 11))
>>           iopolicy = NVME_IOPOLICY_RR;
>> +    else if (!strncmp(val, "queue-depth", 11))
>> +        iopolicy = NVME_IOPOLICY_QD;
>>       else
>>           return -EINVAL;
>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct 
>> kernel_param *kp)
>>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>       &iopolicy, 0644);
>>   MODULE_PARM_DESC(iopolicy,
>> -    "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>> +    "Default multipath I/O policy; 'numa' (default) , 'round-robin' 
>> or 'queue-depth'");
>>   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>   {
>> @@ -329,6 +332,41 @@ static struct nvme_ns 
>> *nvme_round_robin_path(struct nvme_ns_head *head,
>>       return found;
>>   }
>> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
>> +        int node)
>> +{
>> +    struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>> +    unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
>> +    unsigned int depth;
>> +
>> +    list_for_each_entry_rcu(ns, &head->list, siblings) {
>> +        if (nvme_path_is_disabled(ns))
>> +            continue;
>> +
>> +        depth = blk_mq_queue_nr_active(ns->queue);
> 
> That would require an 'atomic_read' on each path for each I/O.
> Which absolutely sucks for performance.

Actually in the current implementation its for every hctx in every
path...

> Considering that we really are only interested in a rough idea about 
> path usage,

I am not sure that a "rough idea" is sufficient here.

> wouldn't it be better to use READ_ONCE()/WRITE_ONCE()
> for accumulating the nr of requests per queue?

I'm assuming that WRITE_ONCE() is a slip up because the path
selector should absolutely not update the hctx active count.

I think its unnatural to access atomic variables non-atomically, and at 
the very least requires a code comment to why this is ok. But regardless
we need some testing to tell us if this is OK to do.

This only matters when there are less hctx than running cpus, but that
is a probable case for fabrics.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"
  2023-09-27  7:38   ` Hannes Reinecke
  2023-09-27 11:02     ` Sagi Grimberg
@ 2023-09-27 13:42     ` Ewan Milne
  1 sibling, 0 replies; 24+ messages in thread
From: Ewan Milne @ 2023-09-27 13:42 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-nvme, tsong, jmeneghi, mlombard

Obviously having to look at the number of active requests on all paths
for each I/O has a performance cost, but the tradeoff is that you pay
a CPU cost in order to get better path distribution/utilization and hopefully
lower I/O latency.  And, you only pay this cost if queue-depth is used.
Otherwise the patches barely touch the code path.

I am not an expert in modern CPU cache architecture, but the updates
to the atomic counters on the request/completion are all per-hctx, the
path selector only does a read, which I think is a simple (no LOCK prefix)
read on x86.  The cost would be in the CPU touching the cacheline.

I tried some alternatives that did not always get the most current value of
nr_active for *each* I/O but didn't get suitable behavior.  I/O requests can
be submitted very quickly, a lazy computation may not adapt fast enough.

-Ewan

On Wed, Sep 27, 2023 at 3:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 9/25/23 18:31, Ewan D. Milne wrote:
> > The existing iopolicies are inefficient in some cases, such as
> > the presence of a path with high latency. The round-robin
> > policy would use that path equally with faster paths, which
> > results in sub-optimal performance.
> >
> > The queue-depth policy instead sends I/O requests down the path
> > with the least amount of requests in its request queue. Paths
> > with lower latency will clear requests more quickly and have less
> > requests in their queues compared to "bad" paths. The aim is to
> > use those paths the most to bring down overall latency.
> >
> > [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> >        and compilation warnings, updated MODULE_PARM description
> >        and changed to use new blk_mq_queue_nr_active() instead of adding
> >        additional atomic counters.]
> >
> > Co-developed-by: Thomas Song <tsong@purestorage.com>
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >   drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
> >   drivers/nvme/host/nvme.h      |  1 +
> >   2 files changed, 47 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > index 0a88d7bdc5e3..deea7fd4aa95 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> >   static const char *nvme_iopolicy_names[] = {
> >       [NVME_IOPOLICY_NUMA]    = "numa",
> >       [NVME_IOPOLICY_RR]      = "round-robin",
> > +     [NVME_IOPOLICY_QD]      = "queue-depth",
> >   };
> >
> >   static int iopolicy = NVME_IOPOLICY_NUMA;
> > @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> >               iopolicy = NVME_IOPOLICY_NUMA;
> >       else if (!strncmp(val, "round-robin", 11))
> >               iopolicy = NVME_IOPOLICY_RR;
> > +     else if (!strncmp(val, "queue-depth", 11))
> > +             iopolicy = NVME_IOPOLICY_QD;
> >       else
> >               return -EINVAL;
> >
> > @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> >   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> >       &iopolicy, 0644);
> >   MODULE_PARM_DESC(iopolicy,
> > -     "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> > +     "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
> >
> >   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> >   {
> > @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> >       return found;
> >   }
> >
> > +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
> > +             int node)
> > +{
> > +     struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> > +     unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> > +     unsigned int depth;
> > +
> > +     list_for_each_entry_rcu(ns, &head->list, siblings) {
> > +             if (nvme_path_is_disabled(ns))
> > +                     continue;
> > +
> > +             depth = blk_mq_queue_nr_active(ns->queue);
>
> That would require an 'atomic_read' on each path for each I/O.
> Which absolutely sucks for performance.
>
> Considering that we really are only interested in a rough idea about
> path usage, wouldn't it be better to use READ_ONCE()/WRITE_ONCE()
> for accumulating the nr of requests per queue?
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option
  2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne
  2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
  2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne
@ 2023-09-25 16:31 ` Ewan D. Milne
  2023-09-27  7:41   ` Hannes Reinecke
  2 siblings, 1 reply; 24+ messages in thread
From: Ewan D. Milne @ 2023-09-25 16:31 UTC (permalink / raw)
  To: linux-nvme; +Cc: tsong, jmeneghi, mlombard

Setting nvme_core.use_nonoptimized=true will cause the path
selector to treat optimized and nonoptimized paths equally.

This is because although an NVMe fabrics target device may report
an unoptimized ANA state, it is possible that other factors such
as fabric latency are a large factor in the I/O service time.  And,
throughput may improve overall if nonoptimized ports are also used.

Signed-off-by: Ewan D. Milne <emilne@redhat.com>
---
 drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index deea7fd4aa95..f640b8ad934b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -13,6 +13,10 @@ bool multipath = true;
 module_param(multipath, bool, 0444);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
+bool use_nonoptimized = false;
+module_param(use_nonoptimized, bool, 0644);
+MODULE_PARM_DESC(use_nonoptimized,
+	"Use ANA nonoptimized paths as well as optimized paths for I/O");
 
 static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
@@ -272,7 +276,10 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 		}
 	}
 
-	if (!found)
+	/*
+	 * If fallback_distance was set, fallback must also be set
+	 */
+	if (!found || (use_nonoptimized && (fallback_distance < found_distance)))
 		found = fallback;
 	if (found)
 		rcu_assign_pointer(head->current_path[node], found);
@@ -310,8 +317,12 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 			found = ns;
 			goto out;
 		}
-		if (ns->ana_state == NVME_ANA_NONOPTIMIZED)
+		if (ns->ana_state == NVME_ANA_NONOPTIMIZED) {
 			found = ns;
+			if (use_nonoptimized)
+				goto out;
+		}
+
 	}
 
 	/*
@@ -321,7 +332,7 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 	 *  - no other usable path found and current is usable.
 	 */
 	if (!nvme_path_is_disabled(old) &&
-	    (old->ana_state == NVME_ANA_OPTIMIZED ||
+	    (old->ana_state == NVME_ANA_OPTIMIZED || use_nonoptimized ||
 	     (!found && old->ana_state == NVME_ANA_NONOPTIMIZED)))
 		return old;
 
@@ -364,6 +375,11 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
 		}
 	}
 
+	/*
+	 * If min_depth_nonopt was set, best_nonopt must also be set
+	 */
+	if (use_nonoptimized && (min_depth_nonopt < min_depth_opt))
+		return best_nonopt;
 	return best_opt ? best_opt : best_nonopt;
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option
  2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne
@ 2023-09-27  7:41   ` Hannes Reinecke
  2023-09-27 11:31     ` Sagi Grimberg
  2023-09-27 13:45     ` Ewan Milne
  0 siblings, 2 replies; 24+ messages in thread
From: Hannes Reinecke @ 2023-09-27  7:41 UTC (permalink / raw)
  To: Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard

On 9/25/23 18:31, Ewan D. Milne wrote:
> Setting nvme_core.use_nonoptimized=true will cause the path
> selector to treat optimized and nonoptimized paths equally.
> 
> This is because although an NVMe fabrics target device may report
> an unoptimized ANA state, it is possible that other factors such
> as fabric latency are a large factor in the I/O service time.  And,
> throughput may improve overall if nonoptimized ports are also used.
> 
> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>   drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
> 
No. Please don't.

There's a reason why controllers specify paths as 'active/optimized' or 
'active/non-optimized'. If they had wanted us to use all paths they 
would have put them into the same group.
They tend to get very unhappy if you start using them at the same time.
(Triggering failover etc.)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option
  2023-09-27  7:41   ` Hannes Reinecke
@ 2023-09-27 11:31     ` Sagi Grimberg
  2023-09-27 13:11       ` John Meneghini
  2023-09-27 13:45     ` Ewan Milne
  1 sibling, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2023-09-27 11:31 UTC (permalink / raw)
  To: Hannes Reinecke, Ewan D. Milne, linux-nvme; +Cc: tsong, jmeneghi, mlombard


>> Setting nvme_core.use_nonoptimized=true will cause the path
>> selector to treat optimized and nonoptimized paths equally.
>>
>> This is because although an NVMe fabrics target device may report
>> an unoptimized ANA state, it is possible that other factors such
>> as fabric latency are a large factor in the I/O service time.  And,
>> throughput may improve overall if nonoptimized ports are also used.
>>
>> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
>> ---
>>   drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>
> No. Please don't.
> 
> There's a reason why controllers specify paths as 'active/optimized' or 
> 'active/non-optimized'. If they had wanted us to use all paths they 
> would have put them into the same group.
> They tend to get very unhappy if you start using them at the same time.
> (Triggering failover etc.)

I have to agree here. This is effectively a modparam that says
all paths are optimized regardless of what the controller reports.

While I do acknowledge that there may be some merit to use non-optimized
paths as well, but its almost impossible to know some latent optimum
path distribution. Hence the host forfeits even attempting.

If the controller wants all path used, it should make all paths
optimized and the host can examine QD accumulating on some paths
vs others.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option
  2023-09-27 11:31     ` Sagi Grimberg
@ 2023-09-27 13:11       ` John Meneghini
  0 siblings, 0 replies; 24+ messages in thread
From: John Meneghini @ 2023-09-27 13:11 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Ewan D. Milne, linux-nvme; +Cc: tsong, mlombard

Ewan I discussed this patch and agreed this is not something we want to go upstream. It was only included here for completeness. 
This patch was only used to increase the number of active paths during the testing of patches 01 and 02.  The test bed Ewan used 
originally had only 4 35Gbps nvme-tcp contollers (2 active optimized and 2 active non-optimized).  He used this patch to change 
the multi-pathing policy and enable the use all 4 controllers - resulting in 4 active paths.

Those test results can be seen here:

https://people.redhat.com/jmeneghi/.multipath/test1/A400-TCP-FIO-RR.ps - round-robin io tests
https://people.redhat.com/jmeneghi/.multipath/test1/A400-TCP-FIO-QD.ps - queue-depth io tests

After sending these patches upstream on Monday Ewan and I built a new test bed with 8 controllers - 4 active optimize paths and 
4 non-optimized paths.  This provided a true multi-path test bed and the use_nonoptimized patch wasn't needed.

In addition, this test bed has a mixed controller subsystem consisting of 4 32GB nvme-fc controllers and 4 100Gbps nvme-tcp 
controllers.  This provided the optimal mix of active optimized controllers paths with different transports that have inherently 
different latency characteristics.

  [root@rhel-storage-104 ~]# nvme list-subsys
nvme-subsys3 - NQN=nqn.1992-08.com.netapp:sn.2b82d9b13bb211ee8744d039ea989119:subsystem.SS104a
\
  +- nvme10 fc traddr=nn-0x2027d039ea98949e:pn-0x202cd039ea98949e,host_traddr=nn-0x200000109b9b7f0d:pn-0x100000109b9b7f0d live
  +- nvme11 fc traddr=nn-0x2027d039ea98949e:pn-0x2029d039ea98949e,host_traddr=nn-0x200000109b9b7f0c:pn-0x100000109b9b7f0c live
  +- nvme12 fc traddr=nn-0x2027d039ea98949e:pn-0x2028d039ea98949e,host_traddr=nn-0x200000109b9b7f0d:pn-0x100000109b9b7f0d live
  +- nvme13 tcp traddr=172.18.50.13,trsvcid=4420,src_addr=172.18.50.3 live
  +- nvme2 tcp traddr=172.18.60.16,trsvcid=4420,src_addr=172.18.60.4 live
  +- nvme3 fc traddr=nn-0x2027d039ea98949e:pn-0x202dd039ea98949e,host_traddr=nn-0x200000109b9b7f0c:pn-0x100000109b9b7f0c live
  +- nvme4 tcp traddr=172.18.50.15,trsvcid=4420,src_addr=172.18.50.3 live
  +- nvme9 tcp traddr=172.18.60.14,trsvcid=4420,src_addr=172.18.60.4 live

I shared the performance graphs for this testbed using these patches at the ALPSS conference today.

Graphs of inflight I/O on 4 Optimized paths (2 FC, 2 TCP) with round-robin:

https://people.redhat.com/jmeneghi/.multipath/test2/A400-TEST1-FIO-RR.ps

and queue-depth:

https://people.redhat.com/jmeneghi/.multipath/test2/A400-TEST1-FIO-QD.ps

Also a included is a graph showing the combined number of inflight I/Os on all paths, plotted with both RR and QD:

https://people.redhat.com/jmeneghi/.multipath/test2/A400-TEST1-FIO-MAX.ps

What we see in these graphs is that RR is only using 1 of the 4 possible paths, and with QD all 4 paths are used about equally 
while the maximum inflight I/Os count almost doubles.

Hope this helps.

John A. Meneghini
Senior Principal Platform Storage Engineer
RHEL SST - Platform Storage Group
jmeneghi@redhat.com

On 9/27/23 13:31, Sagi Grimberg wrote:
> 
>>> Setting nvme_core.use_nonoptimized=true will cause the path
>>> selector to treat optimized and nonoptimized paths equally.
>>>
>>> This is because although an NVMe fabrics target device may report
>>> an unoptimized ANA state, it is possible that other factors such
>>> as fabric latency are a large factor in the I/O service time.  And,
>>> throughput may improve overall if nonoptimized ports are also used.
>>>
>>> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
>>> ---
>>>   drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>> No. Please don't.
>>
>> There's a reason why controllers specify paths as 'active/optimized' or 'active/non-optimized'. If they had wanted us to use 
>> all paths they would have put them into the same group.
>> They tend to get very unhappy if you start using them at the same time.
>> (Triggering failover etc.)
> 
> I have to agree here. This is effectively a modparam that says
> all paths are optimized regardless of what the controller reports.
> 
> While I do acknowledge that there may be some merit to use non-optimized
> paths as well, but its almost impossible to know some latent optimum
> path distribution. Hence the host forfeits even attempting.
> 
> If the controller wants all path used, it should make all paths
> optimized and the host can examine QD accumulating on some paths
> vs others.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option
  2023-09-27  7:41   ` Hannes Reinecke
  2023-09-27 11:31     ` Sagi Grimberg
@ 2023-09-27 13:45     ` Ewan Milne
  1 sibling, 0 replies; 24+ messages in thread
From: Ewan Milne @ 2023-09-27 13:45 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-nvme, tsong, jmeneghi, mlombard

Yes, you are probably right.

I added this patch because what I found was that, for arrays I was using,
the nonoptimized paths had almost the same characteristics as the optimized
paths for some workloads.  They had a small amount of additional latency.

And, it is helpful to have paths available with slightly different performance
in order to experiment with the behavior of different path selectors.

I'm OK with omitting this patch but I wanted people to see how it could be done.

-Ewan

On Wed, Sep 27, 2023 at 3:41 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 9/25/23 18:31, Ewan D. Milne wrote:
> > Setting nvme_core.use_nonoptimized=true will cause the path
> > selector to treat optimized and nonoptimized paths equally.
> >
> > This is because although an NVMe fabrics target device may report
> > an unoptimized ANA state, it is possible that other factors such
> > as fabric latency are a large factor in the I/O service time.  And,
> > throughput may improve overall if nonoptimized ports are also used.
> >
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >   drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
> >   1 file changed, 19 insertions(+), 3 deletions(-)
> >
> No. Please don't.
>
> There's a reason why controllers specify paths as 'active/optimized' or
> 'active/non-optimized'. If they had wanted us to use all paths they
> would have put them into the same group.
> They tend to get very unhappy if you start using them at the same time.
> (Triggering failover etc.)
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-10-04  9:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-25 16:31 [PATCH 0/3] NVMe multipath path selector enhancements Ewan D. Milne
2023-09-25 16:31 ` [PATCH 1/3] block: introduce blk_queue_nr_active() Ewan D. Milne
2023-09-25 20:56   ` Bart Van Assche
2023-09-27 14:49     ` Ewan Milne
2023-09-27 14:58       ` Bart Van Assche
2023-09-27  7:36   ` Hannes Reinecke
2023-09-27 11:37     ` Ming Lei
2023-09-27 13:34     ` Ewan Milne
2023-10-03 20:11       ` Uday Shankar
2023-10-04  9:19         ` Sagi Grimberg
2023-09-27  9:49   ` Ming Lei
2023-09-27 13:54     ` Ewan Milne
2023-09-27 10:56   ` Sagi Grimberg
2023-09-27 13:50     ` Ewan Milne
2023-09-28 10:56       ` Sagi Grimberg
2023-09-25 16:31 ` [PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth" Ewan D. Milne
2023-09-27  7:38   ` Hannes Reinecke
2023-09-27 11:02     ` Sagi Grimberg
2023-09-27 13:42     ` Ewan Milne
2023-09-25 16:31 ` [PATCH 3/3] nvme-multipath: add "use_nonoptimized" module option Ewan D. Milne
2023-09-27  7:41   ` Hannes Reinecke
2023-09-27 11:31     ` Sagi Grimberg
2023-09-27 13:11       ` John Meneghini
2023-09-27 13:45     ` Ewan Milne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox