Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
       [not found] ` <20160916210448.GA1178@localhost.localdomain>
@ 2016-09-19 10:38   ` Alexander Gordeev
  2016-09-19 13:33     ` Bart Van Assche
  2016-09-20 15:00     ` Keith Busch
  0 siblings, 2 replies; 3+ messages in thread
From: Alexander Gordeev @ 2016-09-19 10:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-kernel, Jens Axboe, linux-nvme, linux-block

On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:

CC-ing linux-block@vger.kernel.org

> I'm not sure I see how this helps. That probably means I'm not considering
> the right scenario. Could you elaborate on when having multiple hardware
> queues to choose from a given CPU will provide a benefit?

No, I do not keep in mind any particular scenario besides common
sense. Just an assumption deeper queues are better (in this RFC
a virtual combined queue consisting of multipe h/w queues).

Apparently, there could be positive effects only in systems where
# of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
I do not happen to have ones. If I had numbers this would not be
the RFC and I probably would not have posted in the first place ;)

Would it be possible to give it a try on your hardware?

> If we're out of avaliable h/w tags, having more queues shouldn't
> improve performance. The tag depth on each nvme hw context is already
> deep enough that it should mean even one full queue has saturated the
> device capabilities.

Am I getting you right - a single full nvme hardware queue makes
other queues stalled?

> Having a 1:1 already seemed like the ideal solution since you can't
> simultaneously utilize more than that from the host, so there's no more
> h/w parallelisms from we can exploit. On the controller side, fetching
> commands is serialized memory reads, so I don't think spreading IO
> among more h/w queues helps the target over posting more commands to a
> single queue.

I take a notion of un-ordered commands completion you described below.
But I fail to realize why a CPU would not simultaneously utilize more
than one queue by posting to multiple. Is it due to nvme specifics or
you assume the host would not issue that many commands?

Besides, blk-mq-tag re-uses the latest freed tag and IO should not
actually get spred. Instead, if only currently used hardware queue is
full, the next available queue is chosen. But this is a speculation
without real benchmarks, of course.

> If a CPU has more than one to choose from, a command sent to a less
> used queue would be serviced ahead of previously issued commands on a
> more heavily used one from the same CPU thread due to how NVMe command
> arbitraration works, so it sounds like this would create odd latency
> outliers.

Yep, that sounds scary indeed. Still, any hints on benchmarking
are welcomed.

Many thanks!

> Thanks,
> Keith

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-19 10:38   ` [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Alexander Gordeev
@ 2016-09-19 13:33     ` Bart Van Assche
  2016-09-20 15:00     ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Bart Van Assche @ 2016-09-19 13:33 UTC (permalink / raw)
  To: Alexander Gordeev, Keith Busch
  Cc: linux-kernel@vger.kernel.org, Jens Axboe,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org

On 09/19/16 03:38, Alexander Gordeev wrote:=0A=
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:=0A=
>=0A=
> CC-ing linux-block@vger.kernel.org=0A=
>=0A=
>> I'm not sure I see how this helps. That probably means I'm not consideri=
ng=0A=
>> the right scenario. Could you elaborate on when having multiple hardware=
=0A=
>> queues to choose from a given CPU will provide a benefit?=0A=
>=0A=
> No, I do not keep in mind any particular scenario besides common=0A=
> sense. Just an assumption deeper queues are better (in this RFC=0A=
> a virtual combined queue consisting of multipe h/w queues).=0A=
>=0A=
> Apparently, there could be positive effects only in systems where=0A=
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But=0A=
> I do not happen to have ones. If I had numbers this would not be=0A=
> the RFC and I probably would not have posted in the first place ;)=0A=
>=0A=
> Would it be possible to give it a try on your hardware?=0A=
=0A=
Hello Alexander,=0A=
=0A=
It is your task to measure the performance impact of these patches and =0A=
not Keith's task. BTW, I'm not convinced that multiple hardware queues =0A=
per CPU will result in a performance improvement. I have not yet seen =0A=
any SSD for which a queue depth above 512 results in better performance =0A=
than queue depth equal to 512. Which applications do you think will =0A=
generate and sustain a queue depth above 512? Additionally, my =0A=
experience from another high performance context (RDMA) is that reducing =
=0A=
the number of queues can result in higher IOPS due to fewer interrupts =0A=
per I/O.=0A=
=0A=
Bart.=0A=

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-19 10:38   ` [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Alexander Gordeev
  2016-09-19 13:33     ` Bart Van Assche
@ 2016-09-20 15:00     ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Keith Busch @ 2016-09-20 15:00 UTC (permalink / raw)
  To: Alexander Gordeev; +Cc: linux-kernel, Jens Axboe, linux-nvme, linux-block

On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
> 
> > Having a 1:1 already seemed like the ideal solution since you can't
> > simultaneously utilize more than that from the host, so there's no more
> > h/w parallelisms from we can exploit. On the controller side, fetching
> > commands is serialized memory reads, so I don't think spreading IO
> > among more h/w queues helps the target over posting more commands to a
> > single queue.
> 
> I take a notion of un-ordered commands completion you described below.
> But I fail to realize why a CPU would not simultaneously utilize more
> than one queue by posting to multiple. Is it due to nvme specifics or
> you assume the host would not issue that many commands?

What I mean is that if you have N CPUs, you can't possibly simultaneously
write more than N submission queue entries. The benefit of having 1:1
for the queue <-> CPU mapping is that each CPU can post a command to
its queue without lock contention at the same time as another thread.
Having more to choose from doesn't let the host post commands any faster
than we can today.

When we're out of tags, the request currently just waits for one to
become available, increasing submission latency. You can fix that by
increasing the available tags with deeper or more h/w queues, but that
just increases completion latency since the device can't process them
any faster. It's six of one, half dozen of the other.

The depth per queue defaults to 1k. If your process really is able to use
all those resources, the hardware is completely saturated and you're not
going to benefit from introducing more tags [1]. It could conceivably
be worse by reducing cache-hits, or hit inappropriate timeout handling
with the increased completion latency.

 [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-09-20 15:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1474014910.git.agordeev@redhat.com>
     [not found] ` <20160916210448.GA1178@localhost.localdomain>
2016-09-19 10:38   ` [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Alexander Gordeev
2016-09-19 13:33     ` Bart Van Assche
2016-09-20 15:00     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).