From: Jamie Lokier <lk@tantalophile.demon.co.uk>
To: Davide Libenzi <davidel@xmailserver.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-aio@kvack.org, lse-tech@lists.sourceforge.net,
Linus Torvalds <torvalds@transmeta.com>,
Andrew Morton <akpm@digeo.com>,
Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)
Date: Fri, 1 Nov 2002 02:01:19 +0000 [thread overview]
Message-ID: <20021101020119.GC30865@bjl1.asuk.net> (raw)
In-Reply-To: <Pine.LNX.4.44.0210311642300.1562-100000@blue1.dev.mcafeelabs.com>
Davide Libenzi wrote:
> Jamie, the futex support can be easily done with one line of code patch. I
> still prefer the one-to-one mapping between futexes and files. It makes
> everything easier.
I do agree it is very simple and hence good.
> I don't really see futex creation/destroy as an high frequency event
> that might be suitable for optimization. Usually you have your own
> set of resources to be "protected" and in 95% of cases you know
> those resources from the beginning.
Well, I'll disagree but stay mostly quiet. I think it is reasonable
to have a futex per _object_ in certain language run-times.
Allocation rate: 10,000,000 per second in some examples (f.e. certain
kinds of threaded simulator).
Hardly any of those will need associated fds, and I have no figures on
how many or how often, but you can see that futexes are sometimes used
in a very dynamic way because they are so cheap until contention.
That's the cool thing about futexes: there's absolutely zero kernel
overhead until contention, and only one "long" of overhead in user
space.
At contention, two syscalls resolves it synchronously: futex_wait,
futex_wake. The async method using an fd with epoll takes five:
futex_fd, epoll_ctl, poll, futex_wake, futex_close. That works, but
lacks the _cool_ factor that futexes have IMHO. It should be:
futex_wait_async, futex_wake.
I realise my argument is a weak one though :)
> > > Timer, as long as you access them through a file* interface ( like futexes )
> > > will become trivial too. Another line should be sufficent for dnotify :
> >
> > Sorry (<humble/>), ignore timers. Somehow I picked up the idea that
> > epoll_wait() didn't have a timeout from some example or other, which
> > was very silly of me. I've read the patch properly now! Of course
> > epoll supports timers - a timeout is quite enough for user space.
>
> If you want to timeout I/O operations you can easily put a timer routine
> in your main event scheduler loop. But I still like the idea of timers
> easily accessible through a file* interface.
Sure, but using file * interface implies entering the kernel - that
can sometimes be skipped* if your timer queue is in user space.
* - it happens under heavy load, conveniently.
> > > void __inode_dir_notify(struct inode *inode, unsigned long event)
> >
> > Agreed. This is looking good :)
>
> I asked Linus what he thinks about this one-line patch.
I have no objections to it. Generally, I'd like epoll to be able to
report _what_ the event was (not just POLL_RDNORM, but what kind of
dnotify event), but as I don't get to run on an ideal kernel [;)] I'll
be happy with POLL_RDNORM.
> I still believe that the 1:1 mapping is sufficent and with that in place (
> and the one line patch to kernel/futex.c ) futex support comes nicely.
It does work - actually, with ->poll() you don't need any lines in futex.c :)
Even if a specialised futex hook is added someday, the fd support will
continue to be useful.
> > 2. Add a check to EP_CTL_ADD which checks whether a file supports
> > epoll notifications natively. Perhaps a file_operations hook
> > is in order here. If it does, great. If not, fall back to
> > a generic mechanism that uses the file's ->poll() method. (I
> > haven't thought through for sure how plausible this is).
> > Magically, every kind of fd works, including special devices,
> > and the things that are most performance critical (sockets,
> > pipes, futexes) are tuned. Yum!
>
> Yes, kind of. The hook for an efficent edge triggered event notification
> should be something like the socket one where you have a ->data_ready()
> and ->write_space(), where the caller of these callbacks know that signals
> has to be delivered on 0->1 transactions. With the poll hook you have the
> drawback that the wakeup list is invoked each time data arrives and this
> might generate a little bit too many events. This is no a problem since
> epoll collapse them, but still collapsing do cost CPU cycles.
You avoid the extra CPU cycles like this:
1. EP_CTL_ADD adds the listener to the file's wait queue using
->poll(), and gets a free test of the object readiness [;)]
2. When the transition happens, the wakeup will call your function,
epoll_wakeup_function. That removes the listener from the file's
wait queue. Note, you won't see any more wakeups from that file.
3. When you report the event user space, _then_ you automatically
add the listener back to the file's wait queue by calling ->poll().
This way, there are no spurious wakeups, and nothing to collapse. I
would not be surprised if this is quite fast - perhaps as fast as the
special epoll hooks.
The nice feature that makes this possible is that waitqueues don't
wake up tasks any more: they simply call your choice of callback
function. It was changed for aio, and it's a good change.
-- Jamie
next prev parent reply other threads:[~2002-11-01 1:55 UTC|newest]
Thread overview: 117+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-28 19:14 [PATCH] epoll more scalable than poll Hanna Linder
2002-10-28 20:10 ` Hanna Linder
2002-10-28 20:56 ` Martin Waitz
2002-10-28 22:02 ` bert hubert
2002-10-28 22:15 ` bert hubert
2002-10-28 22:17 ` Davide Libenzi
2002-10-28 22:08 ` bert hubert
2002-10-28 22:12 ` [Lse-tech] " Shailabh Nagar
2002-10-28 22:37 ` Davide Libenzi
2002-10-28 22:29 ` Davide Libenzi
2002-10-28 22:58 ` and nicer too - " bert hubert
2002-10-28 23:23 ` Davide Libenzi
2002-10-28 23:44 ` Jamie Lokier
2002-10-29 0:02 ` Davide Libenzi
2002-10-29 1:51 ` Jamie Lokier
2002-10-29 5:06 ` Davide Libenzi
2002-10-29 11:20 ` Jamie Lokier
2002-10-30 0:16 ` Davide Libenzi
2002-10-29 0:03 ` bert hubert
2002-10-29 0:20 ` Davide Libenzi
2002-10-29 0:48 ` Jamie Lokier
2002-10-29 1:53 ` Jamie Lokier
2002-10-28 23:45 ` and nicer too - " John Gardiner Myers
2002-10-29 0:08 ` Davide Libenzi
2002-10-29 12:59 ` Martin Waitz
2002-10-29 15:19 ` bert hubert
2002-10-29 22:54 ` Martin Waitz
2002-10-30 2:24 ` Davide Libenzi
2002-10-30 19:38 ` Martin Waitz
2002-10-31 5:04 ` Davide Libenzi
2002-10-29 0:18 ` bert hubert
2002-10-29 0:32 ` Davide Libenzi
2002-10-29 0:40 ` bert hubert
2002-10-29 0:57 ` Davide Libenzi
2002-10-29 0:53 ` bert hubert
2002-10-29 1:13 ` Davide Libenzi
2002-10-29 1:08 ` [Lse-tech] " Hanna Linder
2002-10-29 1:39 ` Davide Libenzi
2002-10-29 2:05 ` Jamie Lokier
2002-10-29 2:44 ` Davide Libenzi
2002-10-29 4:01 ` [PATCH] Updated sys_epoll now with man pages Hanna Linder
2002-10-29 5:09 ` Andrew Morton
2002-10-29 5:28 ` [Lse-tech] " Randy.Dunlap
2002-10-29 5:47 ` Davide Libenzi
2002-10-29 5:41 ` Randy.Dunlap
2002-10-29 6:12 ` Davide Libenzi
2002-10-29 6:03 ` Randy.Dunlap
2002-10-29 6:23 ` Davide Libenzi
2002-10-29 14:59 ` Paul Larson
2002-10-29 5:31 ` Davide Libenzi
2002-10-29 7:34 ` Davide Libenzi
2002-10-29 11:04 ` bert hubert
2002-10-29 15:30 ` [Lse-tech] " Shailabh Nagar
2002-10-29 17:45 ` Davide Libenzi
2002-10-29 19:30 ` Hanna Linder
2002-10-29 19:49 ` Davide Libenzi
2002-10-29 13:09 ` and nicer too - Re: [PATCH] epoll more scalable than poll bert hubert
2002-10-29 21:25 ` Davide Libenzi
2002-10-29 21:23 ` Hanna Linder
2002-10-29 21:41 ` Davide Libenzi
2002-10-29 23:06 ` Hanna Linder
2002-10-29 23:14 ` [Lse-tech] " Randy.Dunlap
2002-10-29 23:25 ` Davide Libenzi
2002-10-29 1:47 ` Security critical race condition in epoll code John Gardiner Myers
2002-10-29 2:13 ` Davide Libenzi
2002-10-29 3:38 ` Davide Libenzi
2002-10-29 19:49 ` and nicer too - Re: [PATCH] epoll more scalable than poll John Gardiner Myers
2002-10-29 21:03 ` Davide Libenzi
2002-10-30 0:26 ` Jamie Lokier
2002-10-30 2:09 ` Davide Libenzi
2002-10-30 5:51 ` Davide Libenzi
2002-10-30 2:22 ` John Gardiner Myers
2002-10-30 3:51 ` Davide Libenzi
2002-10-31 2:07 ` John Gardiner Myers
2002-10-31 3:21 ` Davide Libenzi
2002-10-31 11:10 ` [Lse-tech] " Suparna Bhattacharya
2002-10-31 18:42 ` Davide Libenzi
2002-10-30 23:01 ` Jamie Lokier
2002-10-30 23:53 ` Davide Libenzi
2002-10-31 0:52 ` Jamie Lokier
2002-10-31 4:15 ` Davide Libenzi
2002-10-31 15:07 ` Jamie Lokier
2002-10-31 19:10 ` Davide Libenzi
2002-11-01 17:42 ` Dan Kegel
2002-11-01 17:45 ` Davide Libenzi
2002-11-01 18:41 ` Dan Kegel
2002-11-01 19:16 ` Jamie Lokier
2002-11-01 20:04 ` Charlie Krasic
2002-11-01 20:14 ` Jamie Lokier
2002-11-01 20:22 ` Mark Mielke
2002-10-31 15:41 ` Unifying epoll,aio,futexes etc. (What I really want from epoll) Jamie Lokier
2002-10-31 15:48 ` bert hubert
2002-10-31 16:45 ` Alan Cox
2002-10-31 22:00 ` Rusty Russell
2002-11-01 0:32 ` Jamie Lokier
2002-11-01 13:23 ` Alan Cox
2002-10-31 20:28 ` Davide Libenzi
2002-10-31 23:02 ` Jamie Lokier
2002-11-01 1:01 ` Davide Libenzi
2002-11-01 2:01 ` Jamie Lokier [this message]
2002-11-01 17:36 ` Davide Libenzi
2002-11-01 20:45 ` Jamie Lokier
2002-11-01 1:55 ` Matthew D. Hall
2002-11-01 2:54 ` Davide Libenzi
2002-11-01 18:18 ` Dan Kegel
2002-11-01 2:56 ` Jamie Lokier
2002-11-01 4:29 ` Mark Mielke
2002-11-01 4:59 ` Jamie Lokier
2002-11-01 23:27 ` John Gardiner Myers
2002-11-02 4:55 ` Mark Mielke
2002-11-02 15:41 ` Jamie Lokier
2002-11-05 18:15 ` pipe POLLOUT oddity John Gardiner Myers
2002-11-05 18:18 ` Benjamin LaHaise
2002-11-01 23:16 ` Unifying epoll,aio,futexes etc. (What I really want from epoll) John Gardiner Myers
2002-10-30 18:59 ` and nicer too - Re: [PATCH] epoll more scalable than poll Zach Brown
2002-10-30 19:25 ` Davide Libenzi
2002-10-31 16:54 ` Davide Libenzi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20021101020119.GC30865@bjl1.asuk.net \
--to=lk@tantalophile.demon.co.uk \
--cc=akpm@digeo.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=davidel@xmailserver.org \
--cc=linux-aio@kvack.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lse-tech@lists.sourceforge.net \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.