All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Wong <normalperson@yhbt.net>
To: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	peterz@infradead.org, mingo@redhat.com, viro@zeniv.linux.org.uk,
	akpm@linux-foundation.org, davidel@xmailserver.org,
	mtk.manpages@gmail.com, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"luto@amacapital.net >> Andy Lutomirski" <luto@amacapital.net>
Subject: Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
Date: Sun, 22 Feb 2015 00:24:32 +0000	[thread overview]
Message-ID: <20150222002432.GA9031@dcvr.yhbt.net> (raw)
In-Reply-To: <54E557CF.8080702@akamai.com>

Jason Baron <jbaron@akamai.com> wrote:
> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> >>> [...] However, I think the userspace API change is less 
> >>> clear since epoll_wait() doesn't currently have an 
> >>> 'input' events argument as epoll_ctl() does.
> >> ... but the change would be a bit clearer and somewhat 
> >> more flexible: LIFO or FIFO queueing, right?
> >>
> >> But having the queueing model as part of the epoll 
> >> context is a legitimate approach as well.
> > Btw., there's another optimization that the networking code 
> > already does when processing incoming packets: waking up a 
> > thread on the local CPU, where the wakeup is running.
> >
> > Doing the same on epoll would have real scalability 
> > advantages where incoming events are IRQ driven and are 
> > distributed amongst multiple CPUs.
> >
> > Where events are task driven the scheduler will already try 
> > to pair up waker and wakee so it might not show up in 
> > measurements that markedly.
> >
> 
> Right, so this makes me think that we may want to potentially
> support a variety of wakeup policies. Adding these to the
> generic wake up code is just going to be too messy. So, perhaps
> a better approach here would be to register a single
> wait_queue_t with the event source queue that will always
> be woken up, and then layer any epoll balancing/irq affinity
> policies on top of that. So in essence we end up with sort of
> two queues layers, but I think it provides much nicer isolation
> between layers. Also, the bulk of the changes are going to be
> isolated to the epoll code, and we avoid Andy's concern about
> missing, or starving out wakeups.
> 
> So here's a stab at how this API could look:
> 
> 1. ep1 = epoll_create1(EPOLL_POLICY);
> 
> So EPOLL_POLICY here could the round robin policy described
> here, or the irq affinity or other ideas. The idea is to create
> an fd that is local to the process, such that other processes
> can not subsequently attach to it and affect our policy.

I'm not against defining more policies if needed.
Maybe FIFO vs LIFO is a good case for this.

For affinity, it could probably be done transparently based on
epoll_wait retrievals + EPOLL_CTL_MOD operations.

> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
> 
> This associates ep1 with the event source. ep1 can be
> associated with or added to at most 1 wakeup source. This call
> would largely just form the association, but not queue anything
> to the fd_source wait queue.

This would mean one extra FD for every fd_source, but that's
only a handful of FDs (listen sockets), correct?

> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
>     epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
>     epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>      .
>      .
>      .
> 
> Finally, we add the epoll sets to the event source (indirectly via
> ep1). So the first add would actually queue the callback to the
> fd_source. While the subsequent calls would simply queue things
> to the 'nested' wakeup queue associated with ep1.

I'm not sure I follow, wouldn't this increase the number of wakeups?

> So any existing epoll/poll/select calls could be queued as well
> to fd_source and will operate independenly from this mechanism,
> as the fd_source queue continues to be 'wake all'. Also, there
> should be no changes necessary to __wake_up_common(), other
> than potentially passing more back though the
> wait_queue_func_t, such as 'nr_exclusive'.

  reply	other threads:[~2015-02-22  0:24 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-17 19:33 [PATCH v2 0/2] Add epoll round robin wakeup mode Jason Baron
2015-02-17 19:33 ` [PATCH v2 1/2] sched/wait: add " Jason Baron
2015-02-17 19:33 ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
     [not found]   ` <7956874bfdc7403f37afe8a75e50c24221039bd2.1424200151.git.jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
2015-02-18  8:07     ` Ingo Molnar
2015-02-18  8:07       ` Ingo Molnar
     [not found]       ` <20150218080740.GA10199-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-02-18 15:42         ` Jason Baron
2015-02-18 15:42           ` Jason Baron
2015-02-18 16:33           ` Ingo Molnar
2015-02-18 17:38             ` Jason Baron
     [not found]               ` <54E4CE14.5010708-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
2015-02-18 17:45                 ` Ingo Molnar
2015-02-18 17:45                   ` Ingo Molnar
2015-02-18 17:51                   ` Ingo Molnar
     [not found]                     ` <20150218175123.GA31878-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-02-18 22:18                       ` Eric Wong
2015-02-18 22:18                         ` Eric Wong
2015-02-19  3:26                     ` Jason Baron
2015-02-22  0:24                       ` Eric Wong [this message]
     [not found]                         ` <20150222002432.GA9031-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2015-02-25 15:48                           ` Jason Baron
2015-02-25 15:48                             ` Jason Baron
2015-02-18 23:12               ` Andy Lutomirski
     [not found]   ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com>
     [not found]     ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-27 22:24       ` Jason Baron
2015-02-27 22:24         ` Jason Baron
2015-02-17 19:46 ` [PATCH v2 0/2] Add epoll round robin wakeup mode Andy Lutomirski
2015-02-17 20:33   ` Jason Baron
     [not found]     ` <54E3A591.2050806-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
2015-02-17 21:09       ` Andy Lutomirski
2015-02-17 21:09         ` Andy Lutomirski
     [not found]         ` <CALCETrWg9sdyoKg0-BkwKQgyANvJybQ_wqjTfvYEGW1+S1J5Bw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-18  3:15           ` Jason Baron
2015-02-18  3:15             ` Jason Baron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150222002432.GA9031@dcvr.yhbt.net \
    --to=normalperson@yhbt.net \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=davidel@xmailserver.org \
    --cc=jbaron@akamai.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.