Re: [PATCHSET v3 0/5] Add support for epoll min_wait

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait
Date: Tue, 8 Nov 2022 15:29:32 -0500	[thread overview]
Message-ID: <Y2q8LDGCT8rh8zbQ@fedora> (raw)
In-Reply-To: <ba524eab-eff7-5fad-06c2-8188cdf881a1@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 5634 bytes --]

On Tue, Nov 08, 2022 at 10:28:37AM -0700, Jens Axboe wrote:
> On 11/8/22 10:24 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
> >> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> >>> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> >>>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> >>>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >>>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>>>>>> Hi Jens,
> >>>>>>> NICs and storage controllers have interrupt mitigation/coalescing
> >>>>>>> mechanisms that are similar.
> >>>>>>
> >>>>>> Yep
> >>>>>>
> >>>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>>>>>> (counter) value. When a completion occurs, the device waits until the
> >>>>>>> timeout or until the completion counter value is reached.
> >>>>>>>
> >>>>>>> If I've read the code correctly, min_wait is computed at the beginning
> >>>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>>>>>> completion.
> >>>>>>>
> >>>>>>> It makes me wonder which approach is more useful for applications. With
> >>>>>>> the Aggregation Time approach applications can control how much extra
> >>>>>>> latency is added. What do you think about that approach?
> >>>>>>
> >>>>>> We only tested the current approach, which is time noted from entry, not
> >>>>>> from when the first event arrives. I suspect the nvme approach is better
> >>>>>> suited to the hw side, the epoll timeout helps ensure that we batch
> >>>>>> within xx usec rather than xx usec + whatever the delay until the first
> >>>>>> one arrives. Which is why it's handled that way currently. That gives
> >>>>>> you a fixed batch latency.
> >>>>>
> >>>>> min_wait is fine when the goal is just maximizing throughput without any
> >>>>> latency targets.
> >>>>
> >>>> That's not true at all, I think you're in different time scales than
> >>>> this would be used for.
> >>>>
> >>>>> The min_wait approach makes it hard to set a useful upper bound on
> >>>>> latency because unlucky requests that complete early experience much
> >>>>> more latency than requests that complete later.
> >>>>
> >>>> As mentioned in the cover letter or the main patch, this is most useful
> >>>> for the medium load kind of scenarios. For high load, the min_wait time
> >>>> ends up not mattering because you will hit maxevents first anyway. For
> >>>> the testing that we did, the target was 2-300 usec, and 200 usec was
> >>>> used for the actual test. Depending on what the kind of traffic the
> >>>> server is serving, that's usually not much of a concern. From your
> >>>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> >>>> don't think those would make sense. If your rate of arrival is low
> >>>> enough that min_wait needs to be high to make a difference, then the
> >>>> load is low enough anyway that it doesn't matter. Hence I'd argue that
> >>>> it is indeed NOT hard to set a useful upper bound on latency, because
> >>>> that is very much what min_wait is.
> >>>>
> >>>> I'm happy to argue merits of one approach over another, but keep in mind
> >>>> that this particular approach was not pulled out of thin air AND it has
> >>>> actually been tested and verified successfully on a production workload.
> >>>> This isn't a hypothetical benchmark kind of setup.
> >>>
> >>> Fair enough. I just wanted to make sure the syscall interface that gets
> >>> merged is as useful as possible.
> >>
> >> That is indeed the main discussion as far as I'm concerned - syscall,
> >> ctl, or both? At this point I'm inclined to just push forward with the
> >> ctl addition. A new syscall can always be added, and if we do, then it'd
> >> be nice to make one that will work going forward so we don't have to
> >> keep adding epoll_wait variants...
> > 
> > epoll_wait3() would be consistent with how maxevents and timeout work.
> > It does not suffer from extra ctl syscall overhead when applications
> > need to change min_wait.
> > 
> > The way the current patches add min_wait into epoll_ctl() seems hacky to
> > me. struct epoll_event was meant for file descriptor event entries. It
> > won't necessarily be large enough for future extensions (luckily
> > min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
> > ioctl()/setsockopt()-style interface, which is bad for anything that
> > needs to understand syscalls, like seccomp. A properly typed
> > epoll_wait3() seems cleaner to me.
> 
> The ctl method is definitely a bit of an oddball. I've highlighted why
> I went that way in earlier emails, but in summary:
> 
> - Makes it easy to adopt, just adding two lines at init time.
> 
> - Moves detection of availability to init time as well, rather than
>   the fast path.

Add an epoll_create1() flag to test for availability?

> I don't think anyone would want to often change the wait, it's
> something you'd set at init time. If you often want to change values
> for some reason, then obviously a syscall parameter would be a lot
> better.
> 
> epoll_pwait3() would be vastly different than the other ones, simply
> because epoll_pwait2() is already using the maximum number of args.
> We'd need to add an epoll syscall struct at that point, probably
> with flags telling us if signal_struct or timeout is actually valid.

Yes :/.

> This is not to say I don't think we should add a syscall interface,
> just some of the arguments pro and con from having actually looked
> at it.
> 
> -- 
> Jens Axboe
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2022-11-08 20:30 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-30 22:01 [PATCHSET v3 0/5] Add support for epoll min_wait Jens Axboe
2022-10-30 22:01 ` [PATCH 1/6] eventpoll: cleanup branches around sleeping for events Jens Axboe
2022-10-30 22:01 ` [PATCH 2/6] eventpoll: don't pass in 'timed_out' to ep_busy_loop() Jens Axboe
2022-10-30 22:02 ` [PATCH 3/6] eventpoll: split out wait handling Jens Axboe
2022-10-30 22:02 ` [PATCH 4/6] eventpoll: move expires to epoll_wq Jens Axboe
2022-10-30 22:02 ` [PATCH 5/6] eventpoll: move file checking earlier for epoll_ctl() Jens Axboe
2022-10-30 22:02 ` [PATCH 6/6] eventpoll: add support for min-wait Jens Axboe
2022-11-08 22:14   ` Soheil Hassas Yeganeh
2022-11-08 22:20     ` Jens Axboe
2022-11-08 22:25       ` Willem de Bruijn
2022-11-08 22:29         ` Jens Axboe
2022-11-08 22:44           ` Willem de Bruijn
2022-11-08 22:41       ` Soheil Hassas Yeganeh
2022-12-01 18:00       ` Jens Axboe
2022-12-01 18:39         ` Soheil Hassas Yeganeh
2022-12-01 18:41           ` Jens Axboe
2022-11-02 17:46 ` [PATCHSET v3 0/5] Add support for epoll min_wait Willem de Bruijn
2022-11-02 17:54   ` Jens Axboe
2022-11-02 23:09     ` Willem de Bruijn
2022-11-02 23:37       ` Jens Axboe
2022-11-02 23:51         ` Willem de Bruijn
2022-11-02 23:57           ` Jens Axboe
2022-11-05 17:39             ` Jens Axboe
2022-11-05 18:05               ` Willem de Bruijn
2022-11-05 18:46                 ` Jens Axboe
2022-11-07 13:25                   ` Willem de Bruijn
2022-11-07 14:19                     ` Jens Axboe
2022-11-07 10:10               ` David Laight
2022-11-07 20:56 ` Stefan Hajnoczi
2022-11-07 21:38   ` Jens Axboe
2022-11-08 14:00     ` Stefan Hajnoczi
2022-11-08 14:09       ` Jens Axboe
2022-11-08 16:10         ` Stefan Hajnoczi
2022-11-08 16:15           ` Jens Axboe
2022-11-08 17:24             ` Stefan Hajnoczi
2022-11-08 17:28               ` Jens Axboe
2022-11-08 20:29                 ` Stefan Hajnoczi [this message]
2022-11-09 10:09               ` David Laight
2022-11-10 10:13         ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2q8LDGCT8rh8zbQ@fedora \
    --to=stefanha@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.