Bug: epoll_wait timeout is shorter than requested

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bug: epoll_wait timeout is shorter than requested
@ 2005-01-17 11:15 Marcin 'Qrczak' Kowalczyk
  2005-01-17 11:48 ` Jamie Lokier
  0 siblings, 1 reply; 8+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2005-01-17 11:15 UTC (permalink / raw)
  To: linux-fsdevel

A program to exhibit the bug:

----------------
#include <sys/epoll.h>
#include <sys/time.h>
#include <stdio.h>

int main(void) {
   int epoll_fd;
   struct timeval time1, time2;
   struct epoll_event event;
   epoll_fd = epoll_create(16);
   gettimeofday(&time1, NULL);
   epoll_wait(epoll_fd, &event, 1, 1000);
   //poll(NULL, 0, 1000);
   gettimeofday(&time2, NULL);
   printf("start: %d.%06d\n", time1.tv_sec, time1.tv_usec);
   printf("stop:  %d.%06d\n", time2.tv_sec, time2.tv_usec);
   return 0;
}
----------------

It should wait one second (at least). Example output on Linux-2.6.10:
start: 1105958820.439410
stop:  1105958821.438636

With poll used instead of epoll the timeout is OK:
start: 1105958827.975209
stop:  1105958828.975944

Is this list a good place to report this?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 11:15 Bug: epoll_wait timeout is shorter than requested Marcin 'Qrczak' Kowalczyk
@ 2005-01-17 11:48 ` Jamie Lokier
  2005-01-17 13:41   ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 1 reply; 8+ messages in thread
From: Jamie Lokier @ 2005-01-17 11:48 UTC (permalink / raw)
  To: Marcin 'Qrczak' Kowalczyk; +Cc: linux-fsdevel

Marcin 'Qrczak' Kowalczyk wrote:
> A program to exhibit the bug:

The epoll argument rounds like select(), not like poll().
It was done deliberately.

The epoll_wait() behaviour is deliberate, so that it is possible to
wait repeatedly for short time intervals of 1 timer tick.

> It should wait one second (at least). Example output on Linux-2.6.10:
> start: 1105958820.439410
> stop:  1105958821.438636
> 
> With poll used instead of epoll the timeout is OK:
> start: 1105958827.975209
> stop:  1105958828.975944
> 
> Is this list a good place to report this?

For example, on a system with a 100 Hz timer tick, if you want to write
a program which actually times out on each tick, you can write:

    select (nfs, rfds, wfds, efds, { 0, 10000 });
or
    epoll_wait (epfd, events, maxevents, 10);

That will pause until the next tick, allowing programs which need to
do some work smoothly at 100 Hz.

(If you simply want to track the kernel's tick whatever it is, you can
use {0,1} or 1 as the timeouts respectively).

With poll(), because of the way the timeout argument is rounded up by
a tick in the kernel, that's impossible:

    poll (fds, nfds, -1);   <- Waits for a long time.
    poll (fds, nfds, 0);    <- Does not wait at all.
    poll (fds, nfds, 1);    <- Waits for *second* tick after current one.

This means you can only service application events at 50 Hz on a
system where the kernel tick is 100 Hz, if using poll().

This limitation makes poll() unsuitable on Linux for programs which
need to service events at or close to the kernel's tick rate.

This isn't just a problem for programs doing low jitter work.  Many
programs call select/poll/epoll, and then call gettimeofday() after to
decide whether the next "timer" application event is ready to be
serviced, or whether to call select/poll/epoll again.  With the poll()
behaviour, if a previous poll() finished _just_ before the timer event
is ready, the application will call poll() again with timeout 1, and
then it will wait 10-20ms (on a 100 Hz kernel) instead of the far more
desirable 0-10ms.

For historical reasons, perhaps even accidentally, Linux' select()
call rounds the timeout differently and is suitable for this.  We
decided to not change poll() in case it breaks something, but to make
epoll copy the select() rounding because it's more useful.

Note that all programs which depend on select/poll/epoll waiting for
at least the specified time _should_ check the time afterwards anyway,
for example by calling gettimeofday() and waiting again if the desired
time isn't reached yet.

This is because all three calls will return earlier than expected
under some circumstances, i.e. EINTR.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 11:48 ` Jamie Lokier
@ 2005-01-17 13:41   ` Marcin 'Qrczak' Kowalczyk
  2005-01-17 14:33     ` Jamie Lokier
  0 siblings, 1 reply; 8+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2005-01-17 13:41 UTC (permalink / raw)
  To: linux-fsdevel

Jamie Lokier <jamie@shareable.org> writes:

> The epoll argument rounds like select(), not like poll().
> It was done deliberately.

Is it documented?
ftp://ftp.win.tue.nl/pub/home/aeb/linux-local/manpages/man-pages-1.70.tar.gz
doesn't seem to say that the timeout is interpreted differently for
poll and epoll.

Will adding 1ms be enough? In other words is epoll supposed to wait
for some period of time which, when rounded *up* to milliseconds, will
be >= the requested timeout? As contrasted to poll which waits at
least the requested timeout - this behaviour is specified by SUSv3.

I can't observe the semantics of the timeout in select because it's in
microseconds, and a gettimeofday call takes about 2us here. SUSv3 says
that it should wait at least the requested time (except that if the
timeout is longer than a maximum supported timeout, which must be at
least 31 days, then it is allowed to wait shorter). So if select works
like epoll (can wait up to 1us shorter than the requested timeout),
it's not conforming to SUSv3.

> This isn't just a problem for programs doing low jitter work.  Many
> programs call select/poll/epoll, and then call gettimeofday() after to
> decide whether the next "timer" application event is ready to be
> serviced, or whether to call select/poll/epoll again.

This is exactly my case. I noticed that it often finishes a little
before the requested time, and then my program epolls again for 1ms.

> With the poll() behaviour, if a previous poll() finished _just_
> before the timer event is ready, the application will call poll()
> again with timeout 1, and then it will wait 10-20ms (on a 100 Hz
> kernel) instead of the far more desirable 0-10ms.

Well, if the kernel measured the delay more accurately than to a clock
tick, it could notice that a requested 1ms would be satisifed by, say,
8ms which remained from the current tick.

* * *

There is another point where the man page is misleading: it says that
closing a fd will automatically unregister it from epoll sets. In
reality it is unregistered only when the underlying file structure is
released.

* * *

While I understand that the current semantics of sharing epoll fd
across a fork is a consequence of its design, it is inconvenient in
my case. I have to epoll_create again and reregister all descriptors
after a fork, in order for the epoll sets in the two processes to be
independent.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 13:41   ` Marcin 'Qrczak' Kowalczyk
@ 2005-01-17 14:33     ` Jamie Lokier
  2005-01-17 14:43       ` Jamie Lokier
  2005-01-17 16:18       ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 2 replies; 8+ messages in thread
From: Jamie Lokier @ 2005-01-17 14:33 UTC (permalink / raw)
  To: linux-fsdevel

Marcin 'Qrczak' Kowalczyk wrote:
> Is it documented?

Only on linux-kernel some time ago :)

> ftp://ftp.win.tue.nl/pub/home/aeb/linux-local/manpages/man-pages-1.70.tar.gz
> doesn't seem to say that the timeout is interpreted differently for
> poll and epoll.

It doesn't say anything about the difference between poll and select either.

> Will adding 1ms be enough? In other words is epoll supposed to wait
> for some period of time which, when rounded *up* to milliseconds, will
> be >= the requested timeout? As contrasted to poll which waits at
> least the requested timeout - this behaviour is specified by SUSv3.

If you add 1 to the timeout argument to epoll_wait(), you will get the
same behaviour as poll(), because that's what the kernel's poll()
function does internally.

The behaviour is surely quite specific to Linux, though.  If some
other OS implements epoll, it may not have the same timeout behaviour.

> I can't observe the semantics of the timeout in select because it's in
> microseconds, and a gettimeofday call takes about 2us here. SUSv3 says
> that it should wait at least the requested time (except that if the
> timeout is longer than a maximum supported timeout, which must be at
> least 31 days, then it is allowed to wait shorter). So if select works
> like epoll (can wait up to 1us shorter than the requested timeout),
> it's not conforming to SUSv3.

If you call select with { 0, 10000 } - that is, 10 milliseconds, then
you get a delay between 0ms and 10ms on a 100Hz kernel.

That is easy to measure.  Just call select() in a loop and observe the
times.

The man page for select says the timeout serves as an upper bound.

But the man page is wrong too: { 0, 10100 } - that is, 10.1
milliseconds, results in a delay between 10ms and 20ms on a 100Hz
kernel.

By the way, select(), poll() and epoll_wait() all have another bug: if
the timeout parameter is too large, they'll wait *indefinitely*.  They
call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just
calls schedule() with no timer.

In practice, for portable programs which need to time application
events as accurately or low-jitter as possible (even a simple X game
"snake" needed this for the animation to look smooth), such
applications must measure the typical select/poll lateness / earliness
and adapt to the OS-specific behaviour. :(

> > This isn't just a problem for programs doing low jitter work.  Many
> > programs call select/poll/epoll, and then call gettimeofday() after to
> > decide whether the next "timer" application event is ready to be
> > serviced, or whether to call select/poll/epoll again.
> 
> This is exactly my case. I noticed that it often finishes a little
> before the requested time, and then my program epolls again for 1ms.

I agree this is unwanted.  But the obvious fix to this, which is to
make epoll behave like poll(), prevents a useful behaviour when you
want the finest available timer resolution, i.e. waiting until the
next tick.  It is silly to be forced to use select() for that case,
especially when you might be waiting for fds at the same time.

> > With the poll() behaviour, if a previous poll() finished _just_
> > before the timer event is ready, the application will call poll()
> > again with timeout 1, and then it will wait 10-20ms (on a 100 Hz
> > kernel) instead of the far more desirable 0-10ms.
> 
> Well, if the kernel measured the delay more accurately than to a clock
> tick, it could notice that a requested 1ms would be satisifed by, say,
> 8ms which remained from the current tick.

I agree 100%!  That's a good solution.

If select/poll/epoll were implemented by the kernel reading the
current time accurately before deciding how many ticks to wait for,
they could satisfy SUSv3's constraint, _and_ allow the useful
behaviour of application events at the tick rate, _and_ reduce the
number of system calls in some programs which call select().

If you want to change the code in fs/select.c and fs/eventpoll.c to do
this, please do so; I'll be happy to support the case for it.

> There is another point where the man page is misleading: it says that
> closing a fd will automatically unregister it from epoll sets. In
> reality it is unregistered only when the underlying file structure is
> released.

Yes, it means the last close.

> While I understand that the current semantics of sharing epoll fd
> across a fork is a consequence of its design, it is inconvenient in
> my case. I have to epoll_create again and reregister all descriptors
> after a fork, in order for the epoll sets in the two processes to be
> independent.

Fortunately, what you are doing is quite rare.  Usually after fork,
one wants to monitor a different set of fds anyway.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 14:33     ` Jamie Lokier
@ 2005-01-17 14:43       ` Jamie Lokier
  2005-01-17 16:18       ` Marcin 'Qrczak' Kowalczyk
  1 sibling, 0 replies; 8+ messages in thread
From: Jamie Lokier @ 2005-01-17 14:43 UTC (permalink / raw)
  To: linux-fsdevel

Jamie Lokier wrote:
> > Well, if the kernel measured the delay more accurately than to a clock
> > tick, it could notice that a requested 1ms would be satisifed by, say,
> > 8ms which remained from the current tick.
> 
> I agree 100%!  That's a good solution.
> 
> If select/poll/epoll were implemented by the kernel reading the
> current time accurately before deciding how many ticks to wait for,
> they could satisfy SUSv3's constraint, _and_ allow the useful
> behaviour of application events at the tick rate, _and_ reduce the
> number of system calls in some programs which call select().
> 
> If you want to change the code in fs/select.c and fs/eventpoll.c to do
> this, please do so; I'll be happy to support the case for it.

That said, _any_ change to select/poll is sure to break some programs,
which depend on the current quirks.

By the way, the most logically useful interface would take an
*absolute* end time, in any of the forms that the POSIX timer code allows.

This is because nearly every application which calls select/poll/epoll
with a timeout first calls gettimeofday() and then computes the
difference between an absolute time and the current clock time to pass
as the timeout argument - introducing a race condition in the process.

Giving an absolute time would eliminate the race condition _and_ all
calls to the microsecond timer, which is often quite slow.

However, that would require adding yet another non-standard interface.
Perhaps a case can be made for epoll_wait_abstime or something like
that, seeing as epoll is quite non-standard anyway.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 14:33     ` Jamie Lokier
  2005-01-17 14:43       ` Jamie Lokier
@ 2005-01-17 16:18       ` Marcin 'Qrczak' Kowalczyk
  2005-01-17 16:48         ` Jamie Lokier
  1 sibling, 1 reply; 8+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2005-01-17 16:18 UTC (permalink / raw)
  To: linux-fsdevel

Jamie Lokier <jamie@shareable.org> writes:

> If you call select with { 0, 10000 } - that is, 10 milliseconds, then
> you get a delay between 0ms and 10ms on a 100Hz kernel.
>
> That is easy to measure.  Just call select() in a loop and observe the
> times.

I think HZ is now 1000 on x86, so I can't determine experimentally
which 1ms shifts come from the resolution of poll/epoll interface and
which come from the timer frequency.

What you are saying implies that the amount by which select/epoll may
shorten the timeout depends on the timer frequency. Thus a program
which does not know the timer frequency can't know how much to make
the timeout longer, without risking having to sleep again.

I can observe that select rounds the timeout up to a multiple of 1ms,
and then waits for an amount between the resulting time and 1ms
shorter. If the timeout is 12.5ms, it will wait between 12ms and 13ms;
the same is true for any requested timeout > 12ms and <= 13ms.

I guess the 1ms here is actually the timer tick and that in case of
epoll rules are the same, except that the timeout is specified in ms.
That is, it is rounded up to a multiple of timer ticks, and then the
actual timeout is between 0 and 1 tick shorter, such that it ends at
some tick. Right?

This means that depending on the fraction of the current tick which
has elapsed, and the fraction of the timeout we want to sleep, the
optimal request may have two possible values. By optimal request I
mean the one which will give us the shortest delay which is not
shorter than the one we actually want. We don't know which request
to give if we don't know the timer frequency.

For example, assuming 100Hz clock, if we are 3.333ms after a tick
and we want to sleep at least for 124ms, we should give some timeout
between 121ms and 130ms. We will actually sleep 126.667ms, which is
fine. But if we are 6.666ms after a tick is and we want to sleep at
least for 126ms, we should give some timeout between 131ms and 140ms.
This will give us an actual delay of 133.334ms - one tick earlier
would be too short.

So perhaps my program should indeed do what it currently does.
Sometimes the actual delay will be too short and a separate epoll call
will sleep the remaining tick. But if the program always added one
tick, it would sometimes sleep one tick longer than necessary (and
another problem would be that it does know the timer frequency,
so it can't add one tick).

I think this gives the optimal behavior wrt. the number of ticks to
sleep, and the only disadvantage is more syscalls in some cases.

The kernel could make this better because it knows the timer frequency
and it can determine the fraction of the current tick: it would make
the delay longer by 1 tick in case the rest of the current tick is
shorter than the fractional part of the delay (wrt. a tick) reduced by
the unit of resolution of the interface (to allow for 1 unit to mean
"until the next tick").

But it would help only in case the tick is longer than the resolution
of the epoll interface, so perhaps it's not worth the effort - I think
today it's usually 1ms, equal to the epoll resolution. With select it
would help more than with epoll, because the select interface has a
finer resolution, but OTOH select is old-fashioned. And it would only
help in saving some syscalls, it would not provide a behavior which
is unimplementable today.

> The man page for select says the timeout serves as an upper bound.

Well, because of other processes a timeout can always become longer
than requested. It should be an upper bound in the sense that it will
return earlier if a fd is ready. But it should not return earlier if
fds are not ready, pretending that the timeout expired while in fact
it did not.

Except that, as you say, it would prevent specifying a timeout
"until the next tick, even if it's shorter than the resolution
of the interface"...

> By the way, select(), poll() and epoll_wait() all have another bug: if
> the timeout parameter is too large, they'll wait *indefinitely*.  They
> call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just
> calls schedule() with no timer.

Oops. But this should be easy to fix: give MAX_SCHEDULE_TIMEOUT-1.
It's LONG_MAX ticks, i.e. 24 days on a 32-bit machine with 1000Hz
timer?

> If select/poll/epoll were implemented by the kernel reading the
> current time accurately before deciding how many ticks to wait for,
> they could satisfy SUSv3's constraint, _and_ allow the useful
> behaviour of application events at the tick rate, _and_ reduce the
> number of system calls in some programs which call select().

Right.

> If you want to change the code in fs/select.c and fs/eventpoll.c to
> do this, please do so; I'll be happy to support the case for it.

I'm still not sure what the behavior should be. It seems poll and
epoll with their current interfaces can't be made better if the tick
frequency is 1000Hz...

> By the way, the most logically useful interface would take an
> *absolute* end time, in any of the forms that the POSIX timer code
> allows.

Yes! I actually have an absolute time, and compute a timeout from it.

Even if a user of my language specifies a relative time, I convert it
to absolute time first. Then it's converted to relative time in order
to pass the timeout to epoll/poll/select, and then the kernel probably
converts it to absolute time again.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 16:18       ` Marcin 'Qrczak' Kowalczyk
@ 2005-01-17 16:48         ` Jamie Lokier
  2005-01-18 23:27           ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 1 reply; 8+ messages in thread
From: Jamie Lokier @ 2005-01-17 16:48 UTC (permalink / raw)
  To: linux-fsdevel

Marcin 'Qrczak' Kowalczyk wrote:
> > If you call select with { 0, 10000 } - that is, 10 milliseconds, then
> > you get a delay between 0ms and 10ms on a 100Hz kernel.
> >
> > That is easy to measure.  Just call select() in a loop and observe the
> > times.
> 
> I think HZ is now 1000 on x86, so I can't determine experimentally
> which 1ms shifts come from the resolution of poll/epoll interface and
> which come from the timer frequency.

Sure you can.  Call select(0,0,0,0,{0,1000}) in a loop, and you'll
see it returns every 1ms.

Call poll(0,0,1) and you'll see it returns every 2ms.

Call epoll_wait (epfd,0,0,1) and you'll see it returns every 1ms.

That 2ms minimum wait with poll() is the problem, although it was more
of a problem when HZ was 100 (as it still is on some architectures).

> What you are saying implies that the amount by which select/epoll may
> shorten the timeout depends on the timer frequency. Thus a program
> which does not know the timer frequency can't know how much to make
> the timeout longer, without risking having to sleep again.

Correct.

Another way to look at it is that a program which doesn't know the
timer frequency can't know how much to make the poll() timeout shorter
if it wants shortest non-zero timeouts, or if it is trying to time an
event as close as possible to an absolute time.

In my experience with a simple interactive X video game, the only
portable way to do this is to actually measure the times at which
select/poll return and deduce the OS's granularity and late/early
rounding using some kind of control system estimator.  (Then if you
need finer granularity (as the game did) you need a busy loop to add
the remaining sub-tick time).

> I guess the 1ms here is actually the timer tick and that in case of
> epoll rules are the same, except that the timeout is specified in ms.
> That is, it is rounded up to a multiple of timer ticks, and then the
> actual timeout is between 0 and 1 tick shorter, such that it ends at
> some tick. Right?

Yes.  Look at the kernel code: the only difference between epoll and
poll is that poll has "+1" at the end of the equation for the number
of ticks to wait.

> This means that depending on the fraction of the current tick which
> has elapsed, and the fraction of the timeout we want to sleep, the
> optimal request may have two possible values. By optimal request I
> mean the one which will give us the shortest delay which is not
> shorter than the one we actually want. We don't know which request
> to give if we don't know the timer frequency.
> 
> For example, assuming 100Hz clock, if we are 3.333ms after a tick
> and we want to sleep at least for 124ms, we should give some timeout
> between 121ms and 130ms. We will actually sleep 126.667ms, which is
> fine. But if we are 6.666ms after a tick is and we want to sleep at
> least for 126ms, we should give some timeout between 131ms and 140ms.
> This will give us an actual delay of 133.334ms - one tick earlier
> would be too short.
> 
> So perhaps my program should indeed do what it currently does.
> Sometimes the actual delay will be too short and a separate epoll call
> will sleep the remaining tick. But if the program always added one
> tick, it would sometimes sleep one tick longer than necessary (and
> another problem would be that it does know the timer frequency,
> so it can't add one tick).
> 
> I think this gives the optimal behavior wrt. the number of ticks to
> sleep, and the only disadvantage is more syscalls in some cases.
> 
> The kernel could make this better because it knows the timer frequency
> and it can determine the fraction of the current tick: it would make
> the delay longer by 1 tick in case the rest of the current tick is
> shorter than the fractional part of the delay (wrt. a tick) reduced by
> the unit of resolution of the interface (to allow for 1 unit to mean
> "until the next tick").
> 
> But it would help only in case the tick is longer than the resolution
> of the epoll interface, so perhaps it's not worth the effort - I think
> today it's usually 1ms, equal to the epoll resolution. With select it
> would help more than with epoll, because the select interface has a
> finer resolution, but OTOH select is old-fashioned. And it would only
> help in saving some syscalls, it would not provide a behavior which
> is unimplementable today.
> 
> > The man page for select says the timeout serves as an upper bound.
> 
> Well, because of other processes a timeout can always become longer
> than requested. It should be an upper bound in the sense that it will
> return earlier if a fd is ready. But it should not return earlier if
> fds are not ready, pretending that the timeout expired while in fact
> it did not.
> 
> Except that, as you say, it would prevent specifying a timeout
> "until the next tick, even if it's shorter than the resolution
> of the interface"...
> 
> > By the way, select(), poll() and epoll_wait() all have another bug: if
> > the timeout parameter is too large, they'll wait *indefinitely*.  They
> > call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just
> > calls schedule() with no timer.
> 
> Oops. But this should be easy to fix: give MAX_SCHEDULE_TIMEOUT-1.

Yes.  Feel free to submit the patch.

> > If select/poll/epoll were implemented by the kernel reading the
> > current time accurately before deciding how many ticks to wait for,
> > they could satisfy SUSv3's constraint, _and_ allow the useful
> > behaviour of application events at the tick rate, _and_ reduce the
> > number of system calls in some programs which call select().
> 
> Right.
> 
> > If you want to change the code in fs/select.c and fs/eventpoll.c to
> > do this, please do so; I'll be happy to support the case for it.
> 
> I'm still not sure what the behavior should be. It seems poll and
> epoll with their current interfaces can't be made better if the tick
> frequency is 1000Hz...

Correct - but the tick frequency isn't 1000Hz on all architectures and
it isn't likely to change either, because 1000Hz is too fast for
slower CPUs such as in routers and PDAs.

> > By the way, the most logically useful interface would take an
> > *absolute* end time, in any of the forms that the POSIX timer code
> > allows.
> 
> Yes! I actually have an absolute time, and compute a timeout from it.

Nearly every program does.

> Even if a user of my language specifies a relative time, I convert it
> to absolute time first. Then it's converted to relative time in order
> to pass the timeout to epoll/poll/select, and then the kernel probably
> converts it to absolute time again.

Quite.  Silly interface, isn't it? :)  We even get to waste significant
numbers of cycles reading the timer chip every time.  2 microseconds
on your system, ~20 microseconds on some others.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug: epoll_wait timeout is shorter than requested
  2005-01-17 16:48         ` Jamie Lokier
@ 2005-01-18 23:27           ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 0 replies; 8+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2005-01-18 23:27 UTC (permalink / raw)
  To: linux-fsdevel

Jamie Lokier <jamie@shareable.org> writes:

> In my experience with a simple interactive X video game, the only
> portable way to do this is to actually measure the times at which
> select/poll return and deduce the OS's granularity and late/early
> rounding using some kind of control system estimator.  (Then if you
> need finer granularity (as the game did) you need a busy loop to add
> the remaining sub-tick time).

I just implemented something like this in the runtime of my language.
It was surprisingly easy to adjust to varying behaviors of poll and
epoll.

At ./configure time I measure the time poll/epoll will wait when asked
to wait for 1ms, started just after a timer tick (previous delay).
This is the only tricky part, because other activity in the system may
disturb the result, in either direction. So I do this 20 times, sort
the results, skip 3 shortest and 7 longest delays, and expect that
others are equal when rounded up to milliseconds. If they are not
equal, perhaps the system is too busy to give reliable results;
I wait for 1 second and repeat the experiment, up to 50 times.
Any better idea?

The result is in practice 1ms for epoll with 1000Hz, 10ms with 100Hz,
and twice as much for poll. It is 3ms for poll on Alpha (Linux-2.2.20),
I'm not sure why.

Anyway, the program doesn't need to know whether the actual delay can
be shorter or longer than requested, nor the clock frequency. It must
only know this one number.

When running a program, my scheduler computes the timeout for
epoll/poll/select basing on the earliest wakeup time of sleeping
threads. After calling epoll/poll/select, if still no thread is ready,
the scheduler loops again.

This loop used to be a rare event:
- the system clock has been adjusted during waiting
- the timeout was longer than INT_MAX
- spurious wakeup from epoll happened (threads are not unregistered
  from the epoll fd until a spurious wakeup actually happens, because
  usually it does not happen before they would reregister anyway)
- epoll returned earlier than asked (which prompted me to report this
  as a problem here)
- poll/epoll/select failed with EINTR, yet handling the signal did not
  wake up a thread

But now I changed the rules of computation of the timeout. It is
rounded down instead of up; the measured time described above is
subtracted; 1ms is added; if it got below 0, 0 is substituted.

This lets poll/epoll return *before* the planned wakeup time instead
of after (unless the system is busy of course), and the loop will
almost always spin if some thread is about to finish sleeping. In
the next iteration the computed delay will be 0 and the loop will
degenerate to busy waiting, calling gettimeofday and poll/epoll with
timeout 0 (or just gettimeofday if there are no fds to wait for).

If the program knew the timer frequency, the semantics of the timeout,
and the current fraction of the timer tick, it could save one tick of
busy waiting by making the timeout longer by one tick in some cases.
This is not worth the effort.

The semantics of the timeout of epoll is indeed slightly more useful
than of poll: it makes the busy waiting one tick shorter in some cases.

If the measurement done at ./configure time does not apply at runtime,
the only bad things which can happen are inaccurate delays, or longer
busy waiting than needed. This is not critical, so I don't worry.

I didn't bother to perform the analogous adjustment for select,
because it is used only on systems which don't have poll.

Sleeping used to be accurate to about 1ms here, or 10ms with a 100Hz
clock. Now it's accurate to about 30us most of the time. Enough.
The busy waiting is not noticeable in CPU usage.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-01-18 23:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-17 11:15 Bug: epoll_wait timeout is shorter than requested Marcin 'Qrczak' Kowalczyk
2005-01-17 11:48 ` Jamie Lokier
2005-01-17 13:41   ` Marcin 'Qrczak' Kowalczyk
2005-01-17 14:33     ` Jamie Lokier
2005-01-17 14:43       ` Jamie Lokier
2005-01-17 16:18       ` Marcin 'Qrczak' Kowalczyk
2005-01-17 16:48         ` Jamie Lokier
2005-01-18 23:27           ` Marcin 'Qrczak' Kowalczyk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).