Linux io-uring development
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Bernd Schubert <bernd.schubert@fastmail.fm>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org, Joanne Koong <joannelkoong@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>
Subject: Re: Large CQE for fuse headers
Date: Mon, 14 Oct 2024 18:48:13 +0100	[thread overview]
Message-ID: <74b0e140-f79d-4a89-a83a-77334f739c92@gmail.com> (raw)
In-Reply-To: <24ee0d07-47cc-4dcb-bdca-2123f38d7219@fastmail.fm>

On 10/14/24 16:21, Bernd Schubert wrote:
> On 10/14/24 15:34, Pavel Begunkov wrote:
>> On 10/14/24 13:47, Bernd Schubert wrote:
>>> On 10/14/24 13:10, Miklos Szeredi wrote:
>>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <tom.leiming@gmail.com> wrote:
>>>>
>>>>> It also depends on how fuse user code consumes the big CQE payload, if
>>>>> fuse header needs to keep in memory a bit long, you may have to copy it
>>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>>>>> returned back asap.
>>>>
>>>> Yes.
>>>>
>>>> I'm not quite sure how the libfuse interface will work to accommodate
>>>> this.  Currently if the server needs to delay the processing of a
>>>> request it would have to copy all arguments, since validity will not
>>>> be guaranteed after the callback returns.  With the io_uring
>>>> infrastructure the headers would need to be copied, but the data
>>>> buffer would be per-request and would not need copying.  This is
>>>> relaxing a requirement so existing servers would continue to work
>>>> fine, but would not be able to take full advantage of the multi-buffer
>>>> design.
>>>>
>>>> Bernd do you have an idea how this would work?
>>>
>>> I assume returning a CQE is io_uring_cq_advance()?
>>
>> Yes
>>
>>> In my current libfuse io_uring branch that only happens when
>>> all CQEs have been processed. We could also easily switch to
>>> io_uring_cqe_seen() to do it per CQE.
>>
>> Either that one.
>>
>>> I don't understand why we need to return CQEs asap, assuming CQ
>>> ring size is the same as SQ ring size - why does it matter?
>>
>> The SQE is consumed once the request is issued, but nothing
>> prevents the user to keep the QD larger than the SQ size,
>> e.g. do M syscalls each ending N requests and then wait for

typo, Sending or queueing N requests. In other words it's
perfectly legal to:

It's perfectly legal to:

ring = create_ring(nr_cqes=N);
for (i = 0 .. M) {
	for (i = 0..N)
		prep_sqe();
	submit_all_sqes();
}
wait(nr=N * M);


With a caveat that the wait can't complete more than the
CQ size, but you can even add a loop atop of the wait.

while (nr_inflight_cqes) {
	wait(nr = min(CQ_size, nr_inflight_cqes);
	process_cqes();
}

Or do something more elaborate, often frameworks allow
to push any number of requests not caring too much about
exactly matching queue sizes apart from sizing them for
performance reasons.

>> N * M completions.
>>
> 
> I need a bit help to understand this. Do you mean that in typical
> io-uring usage SQEs get submitted, already released in kernel

Typical or not, but the number of requests in flight is not
limited by the size of the SQ, it only limits how many
requests you can queue per syscall, i.e. per io_uring_submit().


> and then users submit even more SQEs? And that creates a
> kernel queue depth for completion?
> I guess as long as libfuse does not expose the ring we don't have
> that issue. But then yeah, exposing the ring to fuse-server/daemon
> is planned...

Could be, for example you don't need to care about overflows
at all if the CQ size is always larger than the number of
requests in flight. Perhaps the simplest example:

prep_requests(nr=N);
wait_cq(nr=N);
process_cqes(nr=N);

>>> If we indeed need to return the CQE before processing the request,
>>> it indeed would be better to have a 2nd memory buffer associated with
>>> the fuse request.
>>
>> With that said, the usual problem is to size the CQ so that it
>> (almost) never overflows, otherwise it hurts performance. With
>> DEFER_TASKRUN you can delay returning CQEs to the kernel until
>> the next time you wait for completions, i.e. do io_uring waiting
>> syscall. Without the flag, CQEs may come asynchronously to the
>> user, so need a bit more consideration.
>>
> 
> Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER,
> IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and
> IORING_SETUP_COOP_TASKRUN as these are somehow slowing down
> things.

Those flags are not a requirement, you can try to size the
CQ so that overflows are rare, it's just a bit easier to do
with DEFER_TASKRUN.

> Not sure if this thread is optimal to discuss this. I would
> also first like to sort out all the other design topics before
> going into fine-tuning...

-- 
Pavel Begunkov

  reply	other threads:[~2024-10-14 17:47 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert
2024-10-11 17:57 ` Jens Axboe
2024-10-11 18:35   ` Bernd Schubert
2024-10-11 18:39     ` Jens Axboe
2024-10-11 19:03       ` Bernd Schubert
2024-10-11 19:24         ` Jens Axboe
2024-10-11 21:38 ` Pavel Begunkov
2024-10-12  1:55 ` Ming Lei
2024-10-12 14:38   ` Jens Axboe
2024-10-13 21:20     ` Bernd Schubert
2024-10-14  2:44       ` Ming Lei
2024-10-14 11:10         ` Miklos Szeredi
2024-10-14 12:47           ` Bernd Schubert
2024-10-14 13:34             ` Pavel Begunkov
2024-10-14 15:21               ` Bernd Schubert
2024-10-14 17:48                 ` Pavel Begunkov [this message]
2024-10-14 21:27                   ` Bernd Schubert
2024-10-16 10:54                     ` Miklos Szeredi
2024-10-16 11:53                       ` Bernd Schubert
2024-10-16 12:24                         ` Miklos Szeredi
2024-10-17  0:59                         ` Ming Lei
2024-10-14 13:20           ` Bernd Schubert
2024-10-14 10:31       ` Miklos Szeredi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74b0e140-f79d-4a89-a83a-77334f739c92@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bernd.schubert@fastmail.fm \
    --cc=io-uring@vger.kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=miklos@szeredi.hu \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox