All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>,
	Artyom Pavlov <newpavlov@gmail.com>,
	io-uring@vger.kernel.org
Subject: Re: Sending CQE to a different ring
Date: Thu, 10 Mar 2022 08:38:31 -0700	[thread overview]
Message-ID: <83ee4c7d-e222-9760-cd4e-5cf2265d54cf@kernel.dk> (raw)
In-Reply-To: <c483c170-9bb7-2f97-744a-267b06b2f142@gmail.com>

On 3/10/22 6:53 AM, Pavel Begunkov wrote:
> On 3/10/22 02:33, Jens Axboe wrote:
>> On 3/9/22 6:55 PM, Jens Axboe wrote:
>>> On 3/9/22 6:36 PM, Jens Axboe wrote:
>>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>>>> Greetings!
>>>>>
>>>>> A common approach for multi-threaded servers is to have a number of
>>>>> threads equal to a number of cores and launch a separate ring in each
>>>>> one. AFAIK currently if we want to send an event to a different ring,
>>>>> we have to write-lock this ring, create SQE, and update the index
>>>>> ring. Alternatively, we could use some kind of user-space message
>>>>> passing.
>>>>>
>>>>> Such approaches are somewhat inefficient and I think it can be solved
>>>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>>>> ring to which CQE must be sent by kernel. It can be done by
>>>>> introducing an IOSQE_ flag and using one of currently unused padding
>>>>> u64s.
>>>>>
>>>>> Such feature could be useful for load balancing and message passing
>>>>> between threads which would ride on top of io-uring, i.e. you could
>>>>> send NOP with user_data pointing to a message payload.
>>>>
>>>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>>>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>>>> flags for that, we just need a NOP that supports that. I see a few ways
>>>> of going about that:
>>>>
>>>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>>>>     io_uring instance. It can then grab the completion lock on that ring
>>>>     and post an empty CQE.
>>>>
>>>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>>>>     'fd' is another ring. Posting CQE same as above.
>>>>
>>>> 3) We add a specific opcode for this. Basically the same as #2, but
>>>>     maybe with a more descriptive name than NOP.
>>>>
>>>> Might make sense to pair that with a CQE flag or something like that, as
>>>> there's no specific user_data that could be used as it doesn't match an
>>>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>>>> Would be applicable to all the above cases.
>>>>
>>>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>>>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>>>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>>>
>>> Something like the below, totally untested. The request will complete on
>>> the original ring with either 0, for success, or -EOVERFLOW if the
>>> target ring was already in an overflow state. If the fd specified isn't
>>> an io_uring context, then the request will complete with -EBADFD.
>>>
>>> If you have any way of testing this, please do. I'll write a basic
>>> functionality test for it as well, but not until tomorrow.
>>>
>>> Maybe we want to include in cqe->res who the waker was? We can stuff the
>>> pid/tid in there, for example.
>>
>> Made the pid change, and also wrote a test case for it. Only change
>> otherwise is adding a completion trace event as well. Patch below
>> against for-5.18/io_uring, and attached the test case for liburing.
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 2e04f718319d..b21f85a48224 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1105,6 +1105,9 @@ static const struct io_op_def io_op_defs[] = {
>>       [IORING_OP_MKDIRAT] = {},
>>       [IORING_OP_SYMLINKAT] = {},
>>       [IORING_OP_LINKAT] = {},
>> +    [IORING_OP_WAKEUP_RING] = {
>> +        .needs_file        = 1,
>> +    },
>>   };
>>     /* requests with any of those set should undergo io_disarm_next() */
>> @@ -4235,6 +4238,44 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags)
>>       return 0;
>>   }
>>   +static int io_wakeup_ring_prep(struct io_kiocb *req,
>> +                   const struct io_uring_sqe *sqe)
>> +{
>> +    if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index || sqe->off ||
>> +             sqe->len || sqe->rw_flags || sqe->splice_fd_in ||
>> +             sqe->buf_index || sqe->personality))
>> +        return -EINVAL;
>> +
>> +    if (req->file->f_op != &io_uring_fops)
>> +        return -EBADFD;
>> +
>> +    return 0;
>> +}
>> +
>> +static int io_wakeup_ring(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +    struct io_uring_cqe *cqe;
>> +    struct io_ring_ctx *ctx;
>> +    int ret = 0;
>> +
>> +    ctx = req->file->private_data;
>> +    spin_lock(&ctx->completion_lock);
>> +    cqe = io_get_cqe(ctx);
>> +    if (cqe) {
>> +        WRITE_ONCE(cqe->user_data, 0);
>> +        WRITE_ONCE(cqe->res, 0);
>> +        WRITE_ONCE(cqe->flags, IORING_CQE_F_WAKEUP);
>> +    } else {
>> +        ret = -EOVERFLOW;
>> +    }
> 
> io_fill_cqe_aux(), maybe? Handles overflows better, increments cq_extra,
> etc. Might also make sense to kick cq_timeouts, so waiters are forced to
> wake up.

I think the main question here is if we want to handle overflows at all,
I deliberately didn't do that. But apart from that io_fill_cqe_aux()
does to everything we need.

I guess the nice thing about actually allocating an overflow entry is
that there's no weird error handling on the submitter side. Let's go
with that.

-- 
Jens Axboe


  reply	other threads:[~2022-03-10 15:38 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-09 23:49 Sending CQE to a different ring Artyom Pavlov
2022-03-10  1:36 ` Jens Axboe
2022-03-10  1:55   ` Jens Axboe
2022-03-10  2:33     ` Jens Axboe
2022-03-10  9:15       ` Chris Panayis
2022-03-10 13:53       ` Pavel Begunkov
2022-03-10 15:38         ` Jens Axboe [this message]
2022-03-10  2:11   ` Artyom Pavlov
2022-03-10  3:00     ` Jens Axboe
2022-03-10  3:48       ` Artyom Pavlov
2022-03-10  4:03         ` Jens Axboe
2022-03-10  4:14           ` Jens Axboe
2022-03-10 14:00             ` Artyom Pavlov
2022-03-10 15:36             ` Artyom Pavlov
2022-03-10 15:43               ` Jens Axboe
2022-03-10 15:46                 ` Jens Axboe
2022-03-10 15:52                   ` Artyom Pavlov
2022-03-10 15:57                     ` Jens Axboe
2022-03-10 16:07                       ` Artyom Pavlov
2022-03-10 16:12                         ` Jens Axboe
2022-03-10 16:22                           ` Artyom Pavlov
2022-03-10 16:25                             ` Jens Axboe
2022-03-10 16:28                               ` Artyom Pavlov
2022-03-10 16:30                                 ` Jens Axboe
2022-03-10 13:34       ` Pavel Begunkov
2022-03-10 13:43         ` Jens Axboe
2022-03-10 13:51           ` Pavel Begunkov
2022-03-10  3:06     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83ee4c7d-e222-9760-cd4e-5cf2265d54cf@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=io-uring@vger.kernel.org \
    --cc=newpavlov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.