From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: Bug: epoll_wait timeout is shorter than requested
Date: Mon, 17 Jan 2005 14:33:48 +0000
Message-ID: <20050117143348.GA23427@mail.shareable.org>
References: <87651wl32d.fsf@qrnik.zagroda> <20050117114821.GB20152@mail.shareable.org> <87r7kk41gp.fsf@qrnik.zagroda>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail.shareable.org ([81.29.64.88]:39351 "EHLO
	mail.shareable.org") by vger.kernel.org with ESMTP id S262807AbVAQOdu
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 17 Jan 2005 09:33:50 -0500
Received: from mail.shareable.org (localhost [127.0.0.1])
	by mail.shareable.org (8.12.8/8.12.8) with ESMTP id j0HEXm81023740
	for <linux-fsdevel@vger.kernel.org>; Mon, 17 Jan 2005 14:33:49 GMT
Received: (from jamie@localhost)
	by mail.shareable.org (8.12.8/8.12.8/Submit) id j0HEXm4b023738
	for linux-fsdevel@vger.kernel.org; Mon, 17 Jan 2005 14:33:48 GMT
To: linux-fsdevel@vger.kernel.org
Content-Disposition: inline
In-Reply-To: <87r7kk41gp.fsf@qrnik.zagroda>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Marcin 'Qrczak' Kowalczyk wrote:
> Is it documented?

Only on linux-kernel some time ago :)

> ftp://ftp.win.tue.nl/pub/home/aeb/linux-local/manpages/man-pages-1.70.tar.gz
> doesn't seem to say that the timeout is interpreted differently for
> poll and epoll.

It doesn't say anything about the difference between poll and select either.

> Will adding 1ms be enough? In other words is epoll supposed to wait
> for some period of time which, when rounded *up* to milliseconds, will
> be >= the requested timeout? As contrasted to poll which waits at
> least the requested timeout - this behaviour is specified by SUSv3.

If you add 1 to the timeout argument to epoll_wait(), you will get the
same behaviour as poll(), because that's what the kernel's poll()
function does internally.

The behaviour is surely quite specific to Linux, though.  If some
other OS implements epoll, it may not have the same timeout behaviour.

> I can't observe the semantics of the timeout in select because it's in
> microseconds, and a gettimeofday call takes about 2us here. SUSv3 says
> that it should wait at least the requested time (except that if the
> timeout is longer than a maximum supported timeout, which must be at
> least 31 days, then it is allowed to wait shorter). So if select works
> like epoll (can wait up to 1us shorter than the requested timeout),
> it's not conforming to SUSv3.

If you call select with { 0, 10000 } - that is, 10 milliseconds, then
you get a delay between 0ms and 10ms on a 100Hz kernel.

That is easy to measure.  Just call select() in a loop and observe the
times.

The man page for select says the timeout serves as an upper bound.

But the man page is wrong too: { 0, 10100 } - that is, 10.1
milliseconds, results in a delay between 10ms and 20ms on a 100Hz
kernel.

By the way, select(), poll() and epoll_wait() all have another bug: if
the timeout parameter is too large, they'll wait *indefinitely*.  They
call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just
calls schedule() with no timer.

In practice, for portable programs which need to time application
events as accurately or low-jitter as possible (even a simple X game
"snake" needed this for the animation to look smooth), such
applications must measure the typical select/poll lateness / earliness
and adapt to the OS-specific behaviour. :(

> > This isn't just a problem for programs doing low jitter work.  Many
> > programs call select/poll/epoll, and then call gettimeofday() after to
> > decide whether the next "timer" application event is ready to be
> > serviced, or whether to call select/poll/epoll again.
> 
> This is exactly my case. I noticed that it often finishes a little
> before the requested time, and then my program epolls again for 1ms.

I agree this is unwanted.  But the obvious fix to this, which is to
make epoll behave like poll(), prevents a useful behaviour when you
want the finest available timer resolution, i.e. waiting until the
next tick.  It is silly to be forced to use select() for that case,
especially when you might be waiting for fds at the same time.

> > With the poll() behaviour, if a previous poll() finished _just_
> > before the timer event is ready, the application will call poll()
> > again with timeout 1, and then it will wait 10-20ms (on a 100 Hz
> > kernel) instead of the far more desirable 0-10ms.
> 
> Well, if the kernel measured the delay more accurately than to a clock
> tick, it could notice that a requested 1ms would be satisifed by, say,
> 8ms which remained from the current tick.

I agree 100%!  That's a good solution.

If select/poll/epoll were implemented by the kernel reading the
current time accurately before deciding how many ticks to wait for,
they could satisfy SUSv3's constraint, _and_ allow the useful
behaviour of application events at the tick rate, _and_ reduce the
number of system calls in some programs which call select().

If you want to change the code in fs/select.c and fs/eventpoll.c to do
this, please do so; I'll be happy to support the case for it.

> There is another point where the man page is misleading: it says that
> closing a fd will automatically unregister it from epoll sets. In
> reality it is unregistered only when the underlying file structure is
> released.

Yes, it means the last close.

> While I understand that the current semantics of sharing epoll fd
> across a fork is a consequence of its design, it is inconvenient in
> my case. I have to epoll_create again and reregister all descriptors
> after a fork, in order for the epoll sets in the two processes to be
> independent.

Fortunately, what you are doing is quite rare.  Usually after fork,
one wants to monitor a different set of fds anyway.

-- Jamie