From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl>
Subject: Re: Bug: epoll_wait timeout is shorter than requested
Date: Mon, 17 Jan 2005 17:18:34 +0100
Message-ID: <87acr8uizp.fsf@qrnik.zagroda>
References: <87651wl32d.fsf@qrnik.zagroda>
	<20050117114821.GB20152@mail.shareable.org>
	<87r7kk41gp.fsf@qrnik.zagroda>
	<20050117143348.GA23427@mail.shareable.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from paf87.warszawa.sdi.tpnet.pl ([217.96.225.87]:10513 "EHLO
	qrnik.knm.org.pl") by vger.kernel.org with ESMTP id S262033AbVAQQSg
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 17 Jan 2005 11:18:36 -0500
Received: from qrczak by qrnik.knm.org.pl with local (Exim 3.36 #1)
	id 1CqZaM-0006pT-00
	for linux-fsdevel@vger.kernel.org; Mon, 17 Jan 2005 17:18:34 +0100
To: linux-fsdevel@vger.kernel.org
In-Reply-To: <20050117143348.GA23427@mail.shareable.org> (Jamie Lokier's
 message of "Mon, 17 Jan 2005 14:33:48 +0000")
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Jamie Lokier <jamie@shareable.org> writes:

> If you call select with { 0, 10000 } - that is, 10 milliseconds, then
> you get a delay between 0ms and 10ms on a 100Hz kernel.
>
> That is easy to measure.  Just call select() in a loop and observe the
> times.

I think HZ is now 1000 on x86, so I can't determine experimentally
which 1ms shifts come from the resolution of poll/epoll interface and
which come from the timer frequency.

What you are saying implies that the amount by which select/epoll may
shorten the timeout depends on the timer frequency. Thus a program
which does not know the timer frequency can't know how much to make
the timeout longer, without risking having to sleep again.

I can observe that select rounds the timeout up to a multiple of 1ms,
and then waits for an amount between the resulting time and 1ms
shorter. If the timeout is 12.5ms, it will wait between 12ms and 13ms;
the same is true for any requested timeout > 12ms and <= 13ms.

I guess the 1ms here is actually the timer tick and that in case of
epoll rules are the same, except that the timeout is specified in ms.
That is, it is rounded up to a multiple of timer ticks, and then the
actual timeout is between 0 and 1 tick shorter, such that it ends at
some tick. Right?

This means that depending on the fraction of the current tick which
has elapsed, and the fraction of the timeout we want to sleep, the
optimal request may have two possible values. By optimal request I
mean the one which will give us the shortest delay which is not
shorter than the one we actually want. We don't know which request
to give if we don't know the timer frequency.

For example, assuming 100Hz clock, if we are 3.333ms after a tick
and we want to sleep at least for 124ms, we should give some timeout
between 121ms and 130ms. We will actually sleep 126.667ms, which is
fine. But if we are 6.666ms after a tick is and we want to sleep at
least for 126ms, we should give some timeout between 131ms and 140ms.
This will give us an actual delay of 133.334ms - one tick earlier
would be too short.

So perhaps my program should indeed do what it currently does.
Sometimes the actual delay will be too short and a separate epoll call
will sleep the remaining tick. But if the program always added one
tick, it would sometimes sleep one tick longer than necessary (and
another problem would be that it does know the timer frequency,
so it can't add one tick).

I think this gives the optimal behavior wrt. the number of ticks to
sleep, and the only disadvantage is more syscalls in some cases.

The kernel could make this better because it knows the timer frequency
and it can determine the fraction of the current tick: it would make
the delay longer by 1 tick in case the rest of the current tick is
shorter than the fractional part of the delay (wrt. a tick) reduced by
the unit of resolution of the interface (to allow for 1 unit to mean
"until the next tick").

But it would help only in case the tick is longer than the resolution
of the epoll interface, so perhaps it's not worth the effort - I think
today it's usually 1ms, equal to the epoll resolution. With select it
would help more than with epoll, because the select interface has a
finer resolution, but OTOH select is old-fashioned. And it would only
help in saving some syscalls, it would not provide a behavior which
is unimplementable today.

> The man page for select says the timeout serves as an upper bound.

Well, because of other processes a timeout can always become longer
than requested. It should be an upper bound in the sense that it will
return earlier if a fd is ready. But it should not return earlier if
fds are not ready, pretending that the timeout expired while in fact
it did not.

Except that, as you say, it would prevent specifying a timeout
"until the next tick, even if it's shorter than the resolution
of the interface"...

> By the way, select(), poll() and epoll_wait() all have another bug: if
> the timeout parameter is too large, they'll wait *indefinitely*.  They
> call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just
> calls schedule() with no timer.

Oops. But this should be easy to fix: give MAX_SCHEDULE_TIMEOUT-1.
It's LONG_MAX ticks, i.e. 24 days on a 32-bit machine with 1000Hz
timer?

> If select/poll/epoll were implemented by the kernel reading the
> current time accurately before deciding how many ticks to wait for,
> they could satisfy SUSv3's constraint, _and_ allow the useful
> behaviour of application events at the tick rate, _and_ reduce the
> number of system calls in some programs which call select().

Right.

> If you want to change the code in fs/select.c and fs/eventpoll.c to
> do this, please do so; I'll be happy to support the case for it.

I'm still not sure what the behavior should be. It seems poll and
epoll with their current interfaces can't be made better if the tick
frequency is 1000Hz...

> By the way, the most logically useful interface would take an
> *absolute* end time, in any of the forms that the POSIX timer code
> allows.

Yes! I actually have an absolute time, and compute a timeout from it.

Even if a user of my language specifies a relative time, I convert it
to absolute time first. Then it's converted to relative time in order
to pass the timeout to epoll/poll/select, and then the kernel probably
converts it to absolute time again.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/