netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait
Date: Wed, 2 Nov 2022 17:57:36 -0600	[thread overview]
Message-ID: <46cb04ca-467c-2e33-f221-3e2a2eaabbda@kernel.dk> (raw)
In-Reply-To: <CA+FuTSdEKsN_47RtW6pOWEnrKkewuDBdsv_qAhR1EyXUr3obrg@mail.gmail.com>

On 11/2/22 5:51 PM, Willem de Bruijn wrote:
> On Wed, Nov 2, 2022 at 7:42 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 11/2/22 5:09 PM, Willem de Bruijn wrote:
>>> On Wed, Nov 2, 2022 at 1:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 11/2/22 11:46 AM, Willem de Bruijn wrote:
>>>>> On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
>>>>>> full numbers.
>>>>>>
>>>>>> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
>>>>>> time that epoll_wait() should wait for events on a given epoll context.
>>>>>> Some justification and numbers are in patch 6, patches 1-5 are really
>>>>>> just prep patches or cleanups.
>>>>>>
>>>>>> Sending this out to get some input on the API, basically. This is
>>>>>> obviously a per-context type of operation in this patchset, which isn't
>>>>>> necessarily ideal for any use case. Questions to be debated:
>>>>>>
>>>>>> 1) Would we want this to be available through epoll_wait() directly?
>>>>>>    That would allow this to be done on a per-epoll_wait() basis, rather
>>>>>>    than be tied to the specific context.
>>>>>>
>>>>>> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
>>>>>>
>>>>>> I think there are pros and cons to both, and perhaps the answer to both is
>>>>>> "yes". There are some benefits to doing this at epoll setup time, for
>>>>>> example - it nicely isolates it to that part rather than needing to be
>>>>>> done dynamically everytime epoll_wait() is called. This also helps the
>>>>>> application code, as it can turn off any busy'ness tracking based on if
>>>>>> the setup accepted EPOLL_CTL_MIN_WAIT or not.
>>>>>>
>>>>>> Anyway, tossing this out there as it yielded quite good results in some
>>>>>> initial testing, we're running more of it. Sending out a v3 now since
>>>>>> someone reported that nonblock issue which is annoying. Hoping to get some
>>>>>> more discussion this time around, or at least some...
>>>>>
>>>>> My main question is whether the cycle gains justify the code
>>>>> complexity and runtime cost in all other epoll paths.
>>>>>
>>>>> Syscall overhead is quite dependent on architecture and things like KPTI.
>>>>
>>>> Definitely interested in experiences from other folks, but what other
>>>> runtime costs do you see compared to the baseline?
>>>
>>> Nothing specific. Possible cost from added branches and moving local
>>> variables into structs with possibly cold cachelines.
>>>
>>>>> Indeed, I was also wondering whether an extra timeout arg to
>>>>> epoll_wait would give the same feature with less side effects. Then no
>>>>> need for that new ctrl API.
>>>>
>>>> That was my main question in this posting - what's the best api? The
>>>> current one, epoll_wait() addition, or both? The nice thing about the
>>>> current one is that it's easy to integrate into existing use cases, as
>>>> the decision to do batching on the userspace side or by utilizing this
>>>> feature can be kept in the setup path. If you do epoll_wait() and get
>>>> -1/EINVAL or false success on older kernels, then that's either a loss
>>>> because of thinking it worked, or a fast path need to check for this
>>>> specifically every time you call epoll_wait() rather than just at
>>>> init/setup time.
>>>>
>>>> But this is very much the question I already posed and wanted to
>>>> discuss...
>>>
>>> I see the value in being able to detect whether the feature is present.
>>>
>>> But a pure epoll_wait implementation seems a lot simpler to me, and
>>> more elegant: timeout is an argument to epoll_wait already.
>>>
>>> A new epoll_wait variant would have to be a new system call, so it
>>> would be easy to infer support for the feature.
>>
>> Right, but it'd still mean that you'd need to check this in the fast
>> path in the app vs being able to do it at init time.
> 
> A process could call the new syscall with timeout 0 at init time to
> learn whether the feature is supported.

That is pretty clunky, though... It'd work, but not a very elegant API.

>> Might there be
>> merit to doing both? From the conversion that we tried, the CTL variant
>> definitely made things easier to port. The new syscall would make enable
>> per-call delays however. There might be some merit to that, though I do
>> think that max_events + min_time is how you'd control batching anything
>> and that's suitably set in the context itself for most use cases.
> 
> I'm surprised a CTL variant is easier to port. An epoll_pwait3 with an
> extra argument only needs to pass that argument to do_epoll_wait.

It's literally adding two lines of code, that's it. A new syscall is way
worse both in terms of the userspace and kernel side for archs, and for
changing call sites in the app.

> FWIW, when adding nsec resolution I initially opted for an init-based
> approach, passing a new flag to epoll_create1. Feedback then was that
> it was odd to have one syscall affect the behavior of another. The
> final version just added a new epoll_pwait2 with timespec.

I'm fine with just doing a pure syscall variant too, it was my original
plan. Only changed it to allow for easier experimentation and adoption,
and based on the fact that most use cases would likely use a fixed value
per context anyway.

I think it'd be a shame to drop the ctl, unless there's strong arguments
against it. I'm quite happy to add a syscall variant too, that's not a
big deal and would be a minor addition. Patch 6 should probably cut out
the ctl addition and leave that for a patch 7, and then a patch 8 for
adding a syscall.

-- 
Jens Axboe

  reply	other threads:[~2022-11-02 23:57 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-30 22:01 [PATCHSET v3 0/5] Add support for epoll min_wait Jens Axboe
2022-10-30 22:01 ` [PATCH 1/6] eventpoll: cleanup branches around sleeping for events Jens Axboe
2022-10-30 22:01 ` [PATCH 2/6] eventpoll: don't pass in 'timed_out' to ep_busy_loop() Jens Axboe
2022-10-30 22:02 ` [PATCH 3/6] eventpoll: split out wait handling Jens Axboe
2022-10-30 22:02 ` [PATCH 4/6] eventpoll: move expires to epoll_wq Jens Axboe
2022-10-30 22:02 ` [PATCH 5/6] eventpoll: move file checking earlier for epoll_ctl() Jens Axboe
2022-10-30 22:02 ` [PATCH 6/6] eventpoll: add support for min-wait Jens Axboe
2022-11-08 22:14   ` Soheil Hassas Yeganeh
2022-11-08 22:20     ` Jens Axboe
2022-11-08 22:25       ` Willem de Bruijn
2022-11-08 22:29         ` Jens Axboe
2022-11-08 22:44           ` Willem de Bruijn
2022-11-08 22:41       ` Soheil Hassas Yeganeh
2022-12-01 18:00       ` Jens Axboe
2022-12-01 18:39         ` Soheil Hassas Yeganeh
2022-12-01 18:41           ` Jens Axboe
2022-11-02 17:46 ` [PATCHSET v3 0/5] Add support for epoll min_wait Willem de Bruijn
2022-11-02 17:54   ` Jens Axboe
2022-11-02 23:09     ` Willem de Bruijn
2022-11-02 23:37       ` Jens Axboe
2022-11-02 23:51         ` Willem de Bruijn
2022-11-02 23:57           ` Jens Axboe [this message]
2022-11-05 17:39             ` Jens Axboe
2022-11-05 18:05               ` Willem de Bruijn
2022-11-05 18:46                 ` Jens Axboe
2022-11-07 13:25                   ` Willem de Bruijn
2022-11-07 14:19                     ` Jens Axboe
2022-11-07 10:10               ` David Laight
2022-11-07 20:56 ` Stefan Hajnoczi
2022-11-07 21:38   ` Jens Axboe
2022-11-08 14:00     ` Stefan Hajnoczi
2022-11-08 14:09       ` Jens Axboe
2022-11-08 16:10         ` Stefan Hajnoczi
2022-11-08 16:15           ` Jens Axboe
2022-11-08 17:24             ` Stefan Hajnoczi
2022-11-08 17:28               ` Jens Axboe
2022-11-08 20:29                 ` Stefan Hajnoczi
2022-11-09 10:09               ` David Laight
2022-11-10 10:13         ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46cb04ca-467c-2e33-f221-3e2a2eaabbda@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).