From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: Bug: epoll_wait timeout is shorter than requested
Date: Mon, 17 Jan 2005 16:48:36 +0000
Message-ID: <20050117164836.GA25815@mail.shareable.org>
References: <87651wl32d.fsf@qrnik.zagroda> <20050117114821.GB20152@mail.shareable.org> <87r7kk41gp.fsf@qrnik.zagroda> <20050117143348.GA23427@mail.shareable.org> <87acr8uizp.fsf@qrnik.zagroda>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail.shareable.org ([81.29.64.88]:3513 "EHLO mail.shareable.org")
	by vger.kernel.org with ESMTP id S262258AbVAQQsh (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 17 Jan 2005 11:48:37 -0500
Received: from mail.shareable.org (localhost [127.0.0.1])
	by mail.shareable.org (8.12.8/8.12.8) with ESMTP id j0HGma81026178
	for <linux-fsdevel@vger.kernel.org>; Mon, 17 Jan 2005 16:48:36 GMT
Received: (from jamie@localhost)
	by mail.shareable.org (8.12.8/8.12.8/Submit) id j0HGmaMK026176
	for linux-fsdevel@vger.kernel.org; Mon, 17 Jan 2005 16:48:36 GMT
To: linux-fsdevel@vger.kernel.org
Content-Disposition: inline
In-Reply-To: <87acr8uizp.fsf@qrnik.zagroda>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Marcin 'Qrczak' Kowalczyk wrote:
> > If you call select with { 0, 10000 } - that is, 10 milliseconds, then
> > you get a delay between 0ms and 10ms on a 100Hz kernel.
> >
> > That is easy to measure.  Just call select() in a loop and observe the
> > times.
> 
> I think HZ is now 1000 on x86, so I can't determine experimentally
> which 1ms shifts come from the resolution of poll/epoll interface and
> which come from the timer frequency.

Sure you can.  Call select(0,0,0,0,{0,1000}) in a loop, and you'll
see it returns every 1ms.

Call poll(0,0,1) and you'll see it returns every 2ms.

Call epoll_wait (epfd,0,0,1) and you'll see it returns every 1ms.

That 2ms minimum wait with poll() is the problem, although it was more
of a problem when HZ was 100 (as it still is on some architectures).

> What you are saying implies that the amount by which select/epoll may
> shorten the timeout depends on the timer frequency. Thus a program
> which does not know the timer frequency can't know how much to make
> the timeout longer, without risking having to sleep again.

Correct.

Another way to look at it is that a program which doesn't know the
timer frequency can't know how much to make the poll() timeout shorter
if it wants shortest non-zero timeouts, or if it is trying to time an
event as close as possible to an absolute time.

In my experience with a simple interactive X video game, the only
portable way to do this is to actually measure the times at which
select/poll return and deduce the OS's granularity and late/early
rounding using some kind of control system estimator.  (Then if you
need finer granularity (as the game did) you need a busy loop to add
the remaining sub-tick time).

> I guess the 1ms here is actually the timer tick and that in case of
> epoll rules are the same, except that the timeout is specified in ms.
> That is, it is rounded up to a multiple of timer ticks, and then the
> actual timeout is between 0 and 1 tick shorter, such that it ends at
> some tick. Right?

Yes.  Look at the kernel code: the only difference between epoll and
poll is that poll has "+1" at the end of the equation for the number
of ticks to wait.

> This means that depending on the fraction of the current tick which
> has elapsed, and the fraction of the timeout we want to sleep, the
> optimal request may have two possible values. By optimal request I
> mean the one which will give us the shortest delay which is not
> shorter than the one we actually want. We don't know which request
> to give if we don't know the timer frequency.
> 
> For example, assuming 100Hz clock, if we are 3.333ms after a tick
> and we want to sleep at least for 124ms, we should give some timeout
> between 121ms and 130ms. We will actually sleep 126.667ms, which is
> fine. But if we are 6.666ms after a tick is and we want to sleep at
> least for 126ms, we should give some timeout between 131ms and 140ms.
> This will give us an actual delay of 133.334ms - one tick earlier
> would be too short.
> 
> So perhaps my program should indeed do what it currently does.
> Sometimes the actual delay will be too short and a separate epoll call
> will sleep the remaining tick. But if the program always added one
> tick, it would sometimes sleep one tick longer than necessary (and
> another problem would be that it does know the timer frequency,
> so it can't add one tick).
> 
> I think this gives the optimal behavior wrt. the number of ticks to
> sleep, and the only disadvantage is more syscalls in some cases.
> 
> The kernel could make this better because it knows the timer frequency
> and it can determine the fraction of the current tick: it would make
> the delay longer by 1 tick in case the rest of the current tick is
> shorter than the fractional part of the delay (wrt. a tick) reduced by
> the unit of resolution of the interface (to allow for 1 unit to mean
> "until the next tick").
> 
> But it would help only in case the tick is longer than the resolution
> of the epoll interface, so perhaps it's not worth the effort - I think
> today it's usually 1ms, equal to the epoll resolution. With select it
> would help more than with epoll, because the select interface has a
> finer resolution, but OTOH select is old-fashioned. And it would only
> help in saving some syscalls, it would not provide a behavior which
> is unimplementable today.
> 
> > The man page for select says the timeout serves as an upper bound.
> 
> Well, because of other processes a timeout can always become longer
> than requested. It should be an upper bound in the sense that it will
> return earlier if a fd is ready. But it should not return earlier if
> fds are not ready, pretending that the timeout expired while in fact
> it did not.
> 
> Except that, as you say, it would prevent specifying a timeout
> "until the next tick, even if it's shorter than the resolution
> of the interface"...
> 
> > By the way, select(), poll() and epoll_wait() all have another bug: if
> > the timeout parameter is too large, they'll wait *indefinitely*.  They
> > call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just
> > calls schedule() with no timer.
> 
> Oops. But this should be easy to fix: give MAX_SCHEDULE_TIMEOUT-1.

Yes.  Feel free to submit the patch.

> > If select/poll/epoll were implemented by the kernel reading the
> > current time accurately before deciding how many ticks to wait for,
> > they could satisfy SUSv3's constraint, _and_ allow the useful
> > behaviour of application events at the tick rate, _and_ reduce the
> > number of system calls in some programs which call select().
> 
> Right.
> 
> > If you want to change the code in fs/select.c and fs/eventpoll.c to
> > do this, please do so; I'll be happy to support the case for it.
> 
> I'm still not sure what the behavior should be. It seems poll and
> epoll with their current interfaces can't be made better if the tick
> frequency is 1000Hz...

Correct - but the tick frequency isn't 1000Hz on all architectures and
it isn't likely to change either, because 1000Hz is too fast for
slower CPUs such as in routers and PDAs.

> > By the way, the most logically useful interface would take an
> > *absolute* end time, in any of the forms that the POSIX timer code
> > allows.
> 
> Yes! I actually have an absolute time, and compute a timeout from it.

Nearly every program does.

> Even if a user of my language specifies a relative time, I convert it
> to absolute time first. Then it's converted to relative time in order
> to pass the timeout to epoll/poll/select, and then the kernel probably
> converts it to absolute time again.

Quite.  Silly interface, isn't it? :)  We even get to waste significant
numbers of cycles reading the timer chip every time.  2 microseconds
on your system, ~20 microseconds on some others.

-- Jamie