From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcin 'Qrczak' Kowalczyk Subject: Re: Bug: epoll_wait timeout is shorter than requested Date: Mon, 17 Jan 2005 17:18:34 +0100 Message-ID: <87acr8uizp.fsf@qrnik.zagroda> References: <87651wl32d.fsf@qrnik.zagroda> <20050117114821.GB20152@mail.shareable.org> <87r7kk41gp.fsf@qrnik.zagroda> <20050117143348.GA23427@mail.shareable.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from paf87.warszawa.sdi.tpnet.pl ([217.96.225.87]:10513 "EHLO qrnik.knm.org.pl") by vger.kernel.org with ESMTP id S262033AbVAQQSg (ORCPT ); Mon, 17 Jan 2005 11:18:36 -0500 Received: from qrczak by qrnik.knm.org.pl with local (Exim 3.36 #1) id 1CqZaM-0006pT-00 for linux-fsdevel@vger.kernel.org; Mon, 17 Jan 2005 17:18:34 +0100 To: linux-fsdevel@vger.kernel.org In-Reply-To: <20050117143348.GA23427@mail.shareable.org> (Jamie Lokier's message of "Mon, 17 Jan 2005 14:33:48 +0000") Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Jamie Lokier writes: > If you call select with { 0, 10000 } - that is, 10 milliseconds, then > you get a delay between 0ms and 10ms on a 100Hz kernel. > > That is easy to measure. Just call select() in a loop and observe the > times. I think HZ is now 1000 on x86, so I can't determine experimentally which 1ms shifts come from the resolution of poll/epoll interface and which come from the timer frequency. What you are saying implies that the amount by which select/epoll may shorten the timeout depends on the timer frequency. Thus a program which does not know the timer frequency can't know how much to make the timeout longer, without risking having to sleep again. I can observe that select rounds the timeout up to a multiple of 1ms, and then waits for an amount between the resulting time and 1ms shorter. If the timeout is 12.5ms, it will wait between 12ms and 13ms; the same is true for any requested timeout > 12ms and <= 13ms. I guess the 1ms here is actually the timer tick and that in case of epoll rules are the same, except that the timeout is specified in ms. That is, it is rounded up to a multiple of timer ticks, and then the actual timeout is between 0 and 1 tick shorter, such that it ends at some tick. Right? This means that depending on the fraction of the current tick which has elapsed, and the fraction of the timeout we want to sleep, the optimal request may have two possible values. By optimal request I mean the one which will give us the shortest delay which is not shorter than the one we actually want. We don't know which request to give if we don't know the timer frequency. For example, assuming 100Hz clock, if we are 3.333ms after a tick and we want to sleep at least for 124ms, we should give some timeout between 121ms and 130ms. We will actually sleep 126.667ms, which is fine. But if we are 6.666ms after a tick is and we want to sleep at least for 126ms, we should give some timeout between 131ms and 140ms. This will give us an actual delay of 133.334ms - one tick earlier would be too short. So perhaps my program should indeed do what it currently does. Sometimes the actual delay will be too short and a separate epoll call will sleep the remaining tick. But if the program always added one tick, it would sometimes sleep one tick longer than necessary (and another problem would be that it does know the timer frequency, so it can't add one tick). I think this gives the optimal behavior wrt. the number of ticks to sleep, and the only disadvantage is more syscalls in some cases. The kernel could make this better because it knows the timer frequency and it can determine the fraction of the current tick: it would make the delay longer by 1 tick in case the rest of the current tick is shorter than the fractional part of the delay (wrt. a tick) reduced by the unit of resolution of the interface (to allow for 1 unit to mean "until the next tick"). But it would help only in case the tick is longer than the resolution of the epoll interface, so perhaps it's not worth the effort - I think today it's usually 1ms, equal to the epoll resolution. With select it would help more than with epoll, because the select interface has a finer resolution, but OTOH select is old-fashioned. And it would only help in saving some syscalls, it would not provide a behavior which is unimplementable today. > The man page for select says the timeout serves as an upper bound. Well, because of other processes a timeout can always become longer than requested. It should be an upper bound in the sense that it will return earlier if a fd is ready. But it should not return earlier if fds are not ready, pretending that the timeout expired while in fact it did not. Except that, as you say, it would prevent specifying a timeout "until the next tick, even if it's shorter than the resolution of the interface"... > By the way, select(), poll() and epoll_wait() all have another bug: if > the timeout parameter is too large, they'll wait *indefinitely*. They > call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just > calls schedule() with no timer. Oops. But this should be easy to fix: give MAX_SCHEDULE_TIMEOUT-1. It's LONG_MAX ticks, i.e. 24 days on a 32-bit machine with 1000Hz timer? > If select/poll/epoll were implemented by the kernel reading the > current time accurately before deciding how many ticks to wait for, > they could satisfy SUSv3's constraint, _and_ allow the useful > behaviour of application events at the tick rate, _and_ reduce the > number of system calls in some programs which call select(). Right. > If you want to change the code in fs/select.c and fs/eventpoll.c to > do this, please do so; I'll be happy to support the case for it. I'm still not sure what the behavior should be. It seems poll and epoll with their current interfaces can't be made better if the tick frequency is 1000Hz... > By the way, the most logically useful interface would take an > *absolute* end time, in any of the forms that the POSIX timer code > allows. Yes! I actually have an absolute time, and compute a timeout from it. Even if a user of my language specifies a relative time, I convert it to absolute time first. Then it's converted to relative time in order to pass the timeout to epoll/poll/select, and then the kernel probably converts it to absolute time again. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/