From: Jamie Lokier <lk@tantalophile.demon.co.uk>
To: Davide Libenzi <davidel@xmailserver.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-aio@kvack.org, lse-tech@lists.sourceforge.net,
Linus Torvalds <torvalds@transmeta.com>,
Andrew Morton <akpm@digeo.com>,
Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)
Date: Fri, 1 Nov 2002 02:01:19 +0000 [thread overview]
Message-ID: <20021101020119.GC30865@bjl1.asuk.net> (raw)
In-Reply-To: <Pine.LNX.4.44.0210311642300.1562-100000@blue1.dev.mcafeelabs.com>
Davide Libenzi wrote:
> Jamie, the futex support can be easily done with one line of code patch. I
> still prefer the one-to-one mapping between futexes and files. It makes
> everything easier.
I do agree it is very simple and hence good.
> I don't really see futex creation/destroy as an high frequency event
> that might be suitable for optimization. Usually you have your own
> set of resources to be "protected" and in 95% of cases you know
> those resources from the beginning.
Well, I'll disagree but stay mostly quiet. I think it is reasonable
to have a futex per _object_ in certain language run-times.
Allocation rate: 10,000,000 per second in some examples (f.e. certain
kinds of threaded simulator).
Hardly any of those will need associated fds, and I have no figures on
how many or how often, but you can see that futexes are sometimes used
in a very dynamic way because they are so cheap until contention.
That's the cool thing about futexes: there's absolutely zero kernel
overhead until contention, and only one "long" of overhead in user
space.
At contention, two syscalls resolves it synchronously: futex_wait,
futex_wake. The async method using an fd with epoll takes five:
futex_fd, epoll_ctl, poll, futex_wake, futex_close. That works, but
lacks the _cool_ factor that futexes have IMHO. It should be:
futex_wait_async, futex_wake.
I realise my argument is a weak one though :)
> > > Timer, as long as you access them through a file* interface ( like futexes )
> > > will become trivial too. Another line should be sufficent for dnotify :
> >
> > Sorry (<humble/>), ignore timers. Somehow I picked up the idea that
> > epoll_wait() didn't have a timeout from some example or other, which
> > was very silly of me. I've read the patch properly now! Of course
> > epoll supports timers - a timeout is quite enough for user space.
>
> If you want to timeout I/O operations you can easily put a timer routine
> in your main event scheduler loop. But I still like the idea of timers
> easily accessible through a file* interface.
Sure, but using file * interface implies entering the kernel - that
can sometimes be skipped* if your timer queue is in user space.
* - it happens under heavy load, conveniently.
> > > void __inode_dir_notify(struct inode *inode, unsigned long event)
> >
> > Agreed. This is looking good :)
>
> I asked Linus what he thinks about this one-line patch.
I have no objections to it. Generally, I'd like epoll to be able to
report _what_ the event was (not just POLL_RDNORM, but what kind of
dnotify event), but as I don't get to run on an ideal kernel [;)] I'll
be happy with POLL_RDNORM.
> I still believe that the 1:1 mapping is sufficent and with that in place (
> and the one line patch to kernel/futex.c ) futex support comes nicely.
It does work - actually, with ->poll() you don't need any lines in futex.c :)
Even if a specialised futex hook is added someday, the fd support will
continue to be useful.
> > 2. Add a check to EP_CTL_ADD which checks whether a file supports
> > epoll notifications natively. Perhaps a file_operations hook
> > is in order here. If it does, great. If not, fall back to
> > a generic mechanism that uses the file's ->poll() method. (I
> > haven't thought through for sure how plausible this is).
> > Magically, every kind of fd works, including special devices,
> > and the things that are most performance critical (sockets,
> > pipes, futexes) are tuned. Yum!
>
> Yes, kind of. The hook for an efficent edge triggered event notification
> should be something like the socket one where you have a ->data_ready()
> and ->write_space(), where the caller of these callbacks know that signals
> has to be delivered on 0->1 transactions. With the poll hook you have the
> drawback that the wakeup list is invoked each time data arrives and this
> might generate a little bit too many events. This is no a problem since
> epoll collapse them, but still collapsing do cost CPU cycles.
You avoid the extra CPU cycles like this:
1. EP_CTL_ADD adds the listener to the file's wait queue using
->poll(), and gets a free test of the object readiness [;)]
2. When the transition happens, the wakeup will call your function,
epoll_wakeup_function. That removes the listener from the file's
wait queue. Note, you won't see any more wakeups from that file.
3. When you report the event user space, _then_ you automatically
add the listener back to the file's wait queue by calling ->poll().
This way, there are no spurious wakeups, and nothing to collapse. I
would not be surprised if this is quite fast - perhaps as fast as the
special epoll hooks.
The nice feature that makes this possible is that waitqueues don't
wake up tasks any more: they simply call your choice of callback
function. It was changed for aio, and it's a good change.
-- Jamie
next prev parent reply other threads:[~2002-11-01 1:55 UTC|newest]
Thread overview: 117+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-28 19:14 [PATCH] epoll more scalable than poll Hanna Linder
2002-10-28 20:10 ` Hanna Linder
2002-10-28 20:56 ` Martin Waitz
2002-10-28 22:02 ` bert hubert
2002-10-28 22:15 ` bert hubert
2002-10-28 22:17 ` Davide Libenzi
2002-10-28 22:08 ` bert hubert
2002-10-28 22:12 ` [Lse-tech] " Shailabh Nagar
2002-10-28 22:37 ` Davide Libenzi
2002-10-28 22:29 ` Davide Libenzi
2002-10-28 22:58 ` and nicer too - " bert hubert
2002-10-28 23:23 ` Davide Libenzi
2002-10-28 23:44 ` Jamie Lokier
2002-10-29 0:02 ` Davide Libenzi
2002-10-29 1:51 ` Jamie Lokier
2002-10-29 5:06 ` Davide Libenzi
2002-10-29 11:20 ` Jamie Lokier
2002-10-30 0:16 ` Davide Libenzi
2002-10-29 0:03 ` bert hubert
2002-10-29 0:20 ` Davide Libenzi
2002-10-29 0:48 ` Jamie Lokier
2002-10-29 1:53 ` Jamie Lokier
2002-10-28 23:45 ` and nicer too - " John Gardiner Myers
2002-10-29 0:08 ` Davide Libenzi
2002-10-29 12:59 ` Martin Waitz
2002-10-29 15:19 ` bert hubert
2002-10-29 22:54 ` Martin Waitz
2002-10-30 2:24 ` Davide Libenzi
2002-10-30 19:38 ` Martin Waitz
2002-10-31 5:04 ` Davide Libenzi
2002-10-29 0:18 ` bert hubert
2002-10-29 0:32 ` Davide Libenzi
2002-10-29 0:40 ` bert hubert
2002-10-29 0:57 ` Davide Libenzi
2002-10-29 0:53 ` bert hubert
2002-10-29 1:13 ` Davide Libenzi
2002-10-29 1:08 ` [Lse-tech] " Hanna Linder
2002-10-29 1:39 ` Davide Libenzi
2002-10-29 2:05 ` Jamie Lokier
2002-10-29 2:44 ` Davide Libenzi
2002-10-29 4:01 ` [PATCH] Updated sys_epoll now with man pages Hanna Linder
2002-10-29 5:09 ` Andrew Morton
2002-10-29 5:28 ` [Lse-tech] " Randy.Dunlap
2002-10-29 5:47 ` Davide Libenzi
2002-10-29 5:41 ` Randy.Dunlap
2002-10-29 6:12 ` Davide Libenzi
2002-10-29 6:03 ` Randy.Dunlap
2002-10-29 6:23 ` Davide Libenzi
2002-10-29 14:59 ` Paul Larson
2002-10-29 5:31 ` Davide Libenzi
2002-10-29 7:34 ` Davide Libenzi
2002-10-29 11:04 ` bert hubert
2002-10-29 15:30 ` [Lse-tech] " Shailabh Nagar
2002-10-29 17:45 ` Davide Libenzi
2002-10-29 19:30 ` Hanna Linder
2002-10-29 19:49 ` Davide Libenzi
2002-10-29 13:09 ` and nicer too - Re: [PATCH] epoll more scalable than poll bert hubert
2002-10-29 21:25 ` Davide Libenzi
2002-10-29 21:23 ` Hanna Linder
2002-10-29 21:41 ` Davide Libenzi
2002-10-29 23:06 ` Hanna Linder
2002-10-29 23:14 ` [Lse-tech] " Randy.Dunlap
2002-10-29 23:25 ` Davide Libenzi
2002-10-29 1:47 ` Security critical race condition in epoll code John Gardiner Myers
2002-10-29 2:13 ` Davide Libenzi
2002-10-29 3:38 ` Davide Libenzi
2002-10-29 19:49 ` and nicer too - Re: [PATCH] epoll more scalable than poll John Gardiner Myers
2002-10-29 21:03 ` Davide Libenzi
2002-10-30 0:26 ` Jamie Lokier
2002-10-30 2:09 ` Davide Libenzi
2002-10-30 5:51 ` Davide Libenzi
2002-10-30 2:22 ` John Gardiner Myers
2002-10-30 3:51 ` Davide Libenzi
2002-10-31 2:07 ` John Gardiner Myers
2002-10-31 3:21 ` Davide Libenzi
2002-10-31 11:10 ` [Lse-tech] " Suparna Bhattacharya
2002-10-31 18:42 ` Davide Libenzi
2002-10-30 23:01 ` Jamie Lokier
2002-10-30 23:53 ` Davide Libenzi
2002-10-31 0:52 ` Jamie Lokier
2002-10-31 4:15 ` Davide Libenzi
2002-10-31 15:07 ` Jamie Lokier
2002-10-31 19:10 ` Davide Libenzi
2002-11-01 17:42 ` Dan Kegel
2002-11-01 17:45 ` Davide Libenzi
2002-11-01 18:41 ` Dan Kegel
2002-11-01 19:16 ` Jamie Lokier
2002-11-01 20:04 ` Charlie Krasic
2002-11-01 20:14 ` Jamie Lokier
2002-11-01 20:22 ` Mark Mielke
2002-10-31 15:41 ` Unifying epoll,aio,futexes etc. (What I really want from epoll) Jamie Lokier
2002-10-31 15:48 ` bert hubert
2002-10-31 16:45 ` Alan Cox
2002-10-31 22:00 ` Rusty Russell
2002-11-01 0:32 ` Jamie Lokier
2002-11-01 13:23 ` Alan Cox
2002-10-31 20:28 ` Davide Libenzi
2002-10-31 23:02 ` Jamie Lokier
2002-11-01 1:01 ` Davide Libenzi
2002-11-01 2:01 ` Jamie Lokier [this message]
2002-11-01 17:36 ` Davide Libenzi
2002-11-01 20:45 ` Jamie Lokier
2002-11-01 1:55 ` Matthew D. Hall
2002-11-01 2:54 ` Davide Libenzi
2002-11-01 18:18 ` Dan Kegel
2002-11-01 2:56 ` Jamie Lokier
2002-11-01 4:29 ` Mark Mielke
2002-11-01 4:59 ` Jamie Lokier
2002-11-01 23:27 ` John Gardiner Myers
2002-11-02 4:55 ` Mark Mielke
2002-11-02 15:41 ` Jamie Lokier
2002-11-05 18:15 ` pipe POLLOUT oddity John Gardiner Myers
2002-11-05 18:18 ` Benjamin LaHaise
2002-11-01 23:16 ` Unifying epoll,aio,futexes etc. (What I really want from epoll) John Gardiner Myers
2002-10-30 18:59 ` and nicer too - Re: [PATCH] epoll more scalable than poll Zach Brown
2002-10-30 19:25 ` Davide Libenzi
2002-10-31 16:54 ` Davide Libenzi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20021101020119.GC30865@bjl1.asuk.net \
--to=lk@tantalophile.demon.co.uk \
--cc=akpm@digeo.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=davidel@xmailserver.org \
--cc=linux-aio@kvack.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lse-tech@lists.sourceforge.net \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox