Re: [RFC 0/3] nvme uring passthrough diet

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kanchan Joshi <joshi.k@samsung.com>
To: Keith Busch <kbusch@kernel.org>
Cc: Keith Busch <kbusch@meta.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	axboe@kernel.dk, hch@lst.de, xiaoguang.wang@linux.alibaba.com
Subject: Re: [RFC 0/3] nvme uring passthrough diet
Date: Fri, 5 May 2023 13:44:55 +0530	[thread overview]
Message-ID: <20230505081455.GA32732@green245> (raw)
In-Reply-To: <ZFJ7pAuTY6ESCVgp@kbusch-mbp.dhcp.thefacebook.com>

[-- Attachment #1: Type: text/plain, Size: 3246 bytes --]

On Wed, May 03, 2023 at 09:20:04AM -0600, Keith Busch wrote:
>On Wed, May 03, 2023 at 12:57:17PM +0530, Kanchan Joshi wrote:
>> On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
>> > From: Keith Busch <kbusch@kernel.org>
>> >
>> > When you disable all the optional features in your kernel config and
>> > request queue, it looks like the normal request dispatching is just as
>> > fast as any attempts to bypass it. So let's do that instead of
>> > reinventing everything.
>> >
>> > This doesn't require additional queues or user setup. It continues to
>> > work with multiple threads and processes, and relies on the well tested
>> > queueing mechanisms that track timeouts, handle tag exhuastion, and sync
>> > with controller state needed for reset control, hotplug events, and
>> > other error handling.
>>
>> I agree with your point that there are some functional holes in
>> the complete-bypass approach. Yet the work was needed to be done
>> to figure out the gain (of approach) and see whether the effort to fill
>> these holes is worth.
>>
>> On your specific points
>> - requiring additional queues: not a showstopper IMO.
>>  If queues are lying unused with HW, we can reap more performance by
>>  giving those to application. If not, we fall back to the existing path.
>>  No disruption as such.
>
>The current way we're reserving special queues is bad and should
>try to not extend it futher. It applies to the whole module and
>would steal resources from some devices that don't want poll queues.
>If you have a mix of device types in your system, the low end ones
>don't want to split their resources this way.
>
>NVMe has no problem creating new queues on the fly. Queue allocation
>doesn't have to be an initialization thing, but you would need to
>reserve the QID's ahead of time.

Totally in agreement with that. Jens also mentioned this point.
And I had added preallocation in my to-be-killed list. Thanks for
expanding.
Related to that, I think one-qid-per-ring also need to be lifted.
That should allow to do io on two/more devices with the single ring
and see how well that scales.

>> - tag exhaustion: that is not missing, a retry will be made. I actually
>>  wanted to do single command-id management at the io_uring level itself,
>>  and that would have cleaned things up. But it did not fit in
>>  because of submission/completion lifetime differences.
>> - timeout and other bits you mentioned: yes, those need more work.
>>
>> Now with the alternate proposed in this series, I doubt whether similar
>> gains are possible. Happy to be wrong if that happens.
>
>One other thing: the pure-bypass does appear better at low queue
>depths, but utilizing the plug for aggregated sq doorbell writes
>is a real win at higher queue depths from this series. Batching
>submissions at 4 deep is the tipping point on my test box; this
>series outperforms pure bypass at any higher batch count.

I see. 
I hit 5M cliff without plug/batching primarily because pure-bypass
is reducing the code to do the IO. But plug/batching is needed to get
better at this.
If we create space for a pointer in io_uring_cmd, that can get added in
the plug list (in place of struct request). That will be one way to sort
out the plugging.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

     prev parent reply	other threads:[~2023-05-05  8:18 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230501154403epcas5p388c607114ad6f9d20dfd3ec958d88947@epcas5p3.samsung.com>
2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
2023-05-01 15:33   ` [RFC 1/3] nvme: skip block cgroups for passthrough commands Keith Busch
2023-05-03  5:04     ` Christoph Hellwig
2023-05-03 15:25       ` Keith Busch
2023-05-15 15:47         ` Keith Busch
2023-05-01 15:33   ` [RFC 2/3] nvme: fix cdev name leak Keith Busch
2023-05-01 15:33   ` [RFC 3/3] nvme: create special request queue for cdev Keith Busch
2023-05-02 12:20     ` Johannes Thumshirn
2023-05-03  5:04     ` Christoph Hellwig
2023-05-03 14:56       ` Keith Busch
2023-05-01 19:01   ` [RFC 0/3] nvme uring passthrough diet Kanchan Joshi
2023-05-01 19:31     ` Keith Busch
2023-05-03  7:27   ` Kanchan Joshi
2023-05-03 15:20     ` Keith Busch
2023-05-05  8:14       ` Kanchan Joshi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230505081455.GA32732@green245 \
    --to=joshi.k@samsung.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kbusch@meta.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.